[jira] [Updated] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.

2018-01-17 Thread Meng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu updated MESOS-8456:

Sprint: Mesosphere Sprint 73

> Allocator should allow roles to burst above guarantees but below limits.
> 
>
> Key: MESOS-8456
> URL: https://issues.apache.org/jira/browse/MESOS-8456
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>
> Currently, allocator only allocates resources for quota roles up to their 
> guarantee in the first allocation stage. The allocator should continue 
> allocating resources to these roles in the second stage below their quota 
> limit. In other words, allocator should allow roles to burst above their 
> guarantee but below the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8068:
--

Assignee: (was: Meng Zhu)

> Non-revocable bursting over quota guarantees via limits.
> 
>
> Key: MESOS-8068
> URL: https://issues.apache.org/jira/browse/MESOS-8068
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy
>
> Prior to introducing a revocable tier of allocation (see MESOS-4441), there 
> is a notion of whether a role can burst over its quota guarantee.
> We currently apply implicit limits in the following way:
> No quota guarantee set: (guarantee 0, no limit)
> Quota guarantee set: (guarantee G, limit G)
> That is, we only allow support burst-only without guarantee and 
> guarantee-only without burst. We do not support bursting over some non-zero 
> guarantee: (guarantee G, limit L >= G).
> The idea here is that we should make these implicit limits explicit to 
> clarify for users the distinction between guarantees and limits, and to 
> support bursting over the guarantee.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8068:
---
Shepherd: Benjamin Mahler

> Non-revocable bursting over quota guarantees via limits.
> 
>
> Key: MESOS-8068
> URL: https://issues.apache.org/jira/browse/MESOS-8068
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy
>
> Prior to introducing a revocable tier of allocation (see MESOS-4441), there 
> is a notion of whether a role can burst over its quota guarantee.
> We currently apply implicit limits in the following way:
> No quota guarantee set: (guarantee 0, no limit)
> Quota guarantee set: (guarantee G, limit G)
> That is, we only allow support burst-only without guarantee and 
> guarantee-only without burst. We do not support bursting over some non-zero 
> guarantee: (guarantee G, limit L >= G).
> The idea here is that we should make these implicit limits explicit to 
> clarify for users the distinction between guarantees and limits, and to 
> support bursting over the guarantee.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8455) Avoid unnecessary copying of protobuf in the v1 API.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8455:
--

Assignee: Benjamin Mahler

> Avoid unnecessary copying of protobuf in the v1 API.
> 
>
> Key: MESOS-8455
> URL: https://issues.apache.org/jira/browse/MESOS-8455
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> Now that we have move support for protobufs, we can avoid the unnecessary 
> copying of protobuf in the v1 API to improve the performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8167) Introduce an executor lifecycle API.

2018-01-17 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329794#comment-16329794
 ] 

Greg Mann commented on MESOS-8167:
--

Since this is a rather forward-looking epic, I thought I'd toss a couple 
thoughts here regarding the long-term future of executors. In my mind, a better 
option for the V2 API would be to make frameworks ignorant of executors 
entirely. I'm not convinced that it's a useful abstraction at this point. I 
would be curious how many users are currently running custom executors, for 
example. The feedback I have gotten from framework devs in the past is that 
custom executors are difficult to deal with, and we have added quite a bit of 
functionality to the default executor in order to increase the breadth of use 
cases it can satisfy.

Whenever we start exploring designs for the V2 API in earnest, I think we 
should consider eliminating custom executors, and completely removing the 
executor from the scheduler API. The logical units of work from the framework's 
perspective are tasks and task groups. Unless there are real use cases which 
cannot be served by exposing executor concepts in the API, I think it would 
simplify the Mesos task lifecycle story and make it easier to develop 
frameworks.

> Introduce an executor lifecycle API.
> 
>
> Key: MESOS-8167
> URL: https://issues.apache.org/jira/browse/MESOS-8167
> Project: Mesos
>  Issue Type: Epic
>  Components: scheduler api
>Reporter: Benjamin Mahler
>Priority: Major
>
> Much like agents and tasks, there is a need for schedulers to track the 
> lifecycle of executors that they own (updates, reconciliation, health) as 
> well.
> Schedulers can currently shut their executors down, but without having a 
> lifecycle API to know when they are running, this isn't very useful.
> Part of this effort would include ensuring that the executors are correctly 
> kept in sync across the master and agent: MESOS-1961.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7882) Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves

2018-01-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-7882:
-
Shepherd: Benjamin Mahler
Story Points: 3
  Labels: maintenance mesosphere  (was: )
  Sprint: Mesosphere Sprint 73
Target Version/s: 1.6.0

> Mesos master rescinds all the in-flight offers from all the registered agents 
> when a new maintenance schedule is posted for a subset of slaves
> --
>
> Key: MESOS-7882
> URL: https://issues.apache.org/jira/browse/MESOS-7882
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.3.0
> Environment: Ubuntu 14:04(trusty)
> Mesos master branch.
> SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded
>Reporter: Sagar Sadashiv Patwardhan
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: maintenance, mesosphere
>
> We are running mesos 1.1.0 in production. We use a custom autoscaler for 
> scaling our mesos  cluster up and down. While scaling down the cluster, 
> autoscaler makes a POST request to mesos master /maintenance/schedule 
> endpoint with a set of slaves to move to maintenance mode. This forces mesos 
> master to rescind all the in-flight offers from *all the slaves* in the 
> cluster. If our scheduler accepts one of these offers, then we get a 
> TASK_LOST status update back for that task. We also see such 
> (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log 
> lines in mesos master logs.
> After reading the code(refs: 
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it 
> appears that offers are getting rescinded for all the slaves. I am not sure 
> what is the expected behavior here, but it makes more sense if only resources 
> from slaves marked for maintenance are reclaimed.
> *Experiment:*
> To verify that it is actually happening, I checked out the master branch(sha: 
> a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log 
> lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). 
> Built the binary and started a mesos master and 2 agent processes. Used a 
> basic python framework that launches docker containers on these slaves. 
> Verified that there is no existing schedule for any slaves using `curl 
> 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of 
> the 
> slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) 
> after starting the mesos framework.
> *Logs:*
> mesos-master: 
> https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
> mesos-slave1: 
> https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
> mesos-slave2: 
> https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
> Mesos framework: 
> https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a
> I think mesos should rescind offers and inverse offers only for those slaves 
> that are marked for maintenance(draining mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7882) Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves

2018-01-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-7882:


Assignee: Joseph Wu

> Mesos master rescinds all the in-flight offers from all the registered agents 
> when a new maintenance schedule is posted for a subset of slaves
> --
>
> Key: MESOS-7882
> URL: https://issues.apache.org/jira/browse/MESOS-7882
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.3.0
> Environment: Ubuntu 14:04(trusty)
> Mesos master branch.
> SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded
>Reporter: Sagar Sadashiv Patwardhan
>Assignee: Joseph Wu
>Priority: Minor
>
> We are running mesos 1.1.0 in production. We use a custom autoscaler for 
> scaling our mesos  cluster up and down. While scaling down the cluster, 
> autoscaler makes a POST request to mesos master /maintenance/schedule 
> endpoint with a set of slaves to move to maintenance mode. This forces mesos 
> master to rescind all the in-flight offers from *all the slaves* in the 
> cluster. If our scheduler accepts one of these offers, then we get a 
> TASK_LOST status update back for that task. We also see such 
> (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log 
> lines in mesos master logs.
> After reading the code(refs: 
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it 
> appears that offers are getting rescinded for all the slaves. I am not sure 
> what is the expected behavior here, but it makes more sense if only resources 
> from slaves marked for maintenance are reclaimed.
> *Experiment:*
> To verify that it is actually happening, I checked out the master branch(sha: 
> a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log 
> lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). 
> Built the binary and started a mesos master and 2 agent processes. Used a 
> basic python framework that launches docker containers on these slaves. 
> Verified that there is no existing schedule for any slaves using `curl 
> 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of 
> the 
> slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) 
> after starting the mesos framework.
> *Logs:*
> mesos-master: 
> https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
> mesos-slave1: 
> https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
> mesos-slave2: 
> https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
> Mesos framework: 
> https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a
> I think mesos should rescind offers and inverse offers only for those slaves 
> that are marked for maintenance(draining mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7342) Port Docker tests

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329733#comment-16329733
 ] 

Andrew Schwartzmeyer commented on MESOS-7342:
-

{noformat}
commit 8964137f6 (HEAD -> master, apache/master, apache/HEAD)
Author: Akash Gupta 
Date:   Wed Jan 17 13:51:59 2018 -0800

Windows: Updated networking documentation.

The networking docs now describe how the Docker network modes in the
`Network` enum work on Windows, since the enum only has Linux network
modes.

Review: https://reviews.apache.org/r/63861/

commit 6b35c93ba
Author: Akash Gupta 
Date:   Wed Jan 17 13:51:44 2018 -0800

Windows: Mapped the Docker network info types.

The Network enum in DockerInfo is specific to Linux containers. `HOST`
doesn't exist on Windows and `BRIDGE` is `NAT` on Windows. The current
default docker network setting was always `HOST`, which broke the
Windows docker executor. Now, if a specific network isn't given, the
network mode will default to `HOST` on Linux agents and `NAT` on Windows
agents. Also, `BRIDGE` mode will be translated to `NAT` on Windows.

Review: https://reviews.apache.org/r/63860/

commit eccd0a9f9
Author: Akash Gupta 
Date:   Wed Jan 17 13:51:32 2018 -0800

Windows: Fixed mock signal values in stout.

Removed `SIGSTOP` and `SIGCONT` on Windows, since they are meaningless,
and never unused. Also, fixed the WEXITSTATUS macro to cast the exit
code instead of bit-masking it, since Windows exit codes are 32 bit
unsigned ints.

Review: https://reviews.apache.org/r/63859/
{noformat}

> Port Docker tests
> -
>
> Key: MESOS-7342
> URL: https://issues.apache.org/jira/browse/MESOS-7342
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Akash Gupta
>Priority: Major
>  Labels: microsoft, windows
>
> While one of Daniel Pravat's last acts was introducing the the Docker 
> containerizer for Windows, we don't have tests. We need to port 
> `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7342) Port Docker tests

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-7342:
---

Assignee: Akash Gupta  (was: Andrew Schwartzmeyer)

> Port Docker tests
> -
>
> Key: MESOS-7342
> URL: https://issues.apache.org/jira/browse/MESOS-7342
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Akash Gupta
>Priority: Major
>  Labels: microsoft, windows
>
> While one of Daniel Pravat's last acts was introducing the the Docker 
> containerizer for Windows, we don't have tests. We need to port 
> `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7342) Port Docker tests

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-7342:
---

Assignee: Andrew Schwartzmeyer  (was: John Kordich)

> Port Docker tests
> -
>
> Key: MESOS-7342
> URL: https://issues.apache.org/jira/browse/MESOS-7342
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: microsoft, windows
>
> While one of Daniel Pravat's last acts was introducing the the Docker 
> containerizer for Windows, we don't have tests. We need to port 
> `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8378) ExamplesTest.PythonFramework stucks.

2018-01-17 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-8378:
-

Assignee: Till Toenshoff

> ExamplesTest.PythonFramework stucks.
> 
>
> Key: MESOS-8378
> URL: https://issues.apache.org/jira/browse/MESOS-8378
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: MacOS with SSL
>Reporter: Alexander Rukletsov
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: flaky-test
> Attachments: ExamplesTest.PythonFramework-badrun1.txt, 
> ExamplesTest.PythonFramework-badrun2.txt
>
>
> Observed this failure today twice on MacOS box. Full logs attached. These 
> lines look suspicious to me:
> {noformat}
> 10:22:22 W0103 02:22:22.359180 3747840 sched.cpp:526] Authentication timed out
> 10:22:22 I0103 02:22:22.359292 1064960 sched.cpp:466] Failed to authenticate 
> with master master@10.0.49.4:62351: Authentication discarded
> 10:22:22 E0103 02:22:22.559609 528384 process.cpp:2922] libprocess: 
> slave(2)@10.0.49.4:62351 terminating due to unordered_map::at: key not found
> 10:22:22 E0103 02:22:22.947485 1064960 process.cpp:2922] libprocess: 
> slave(3)@10.0.49.4:62351 terminating due to unordered_map::at: key not found
> 10:22:23 E0103 02:22:23.008870 528384 process.cpp:2922] libprocess: 
> slave(1)@10.0.49.4:62351 terminating due to unordered_map::at: key not found
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-01-17 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329711#comment-16329711
 ] 

James Peach edited comment on MESOS-6575 at 1/17/18 11:53 PM:
--

Yeh, I think that using the soft limit is a pretty good idea. We can set the 
soft limit to the resources and the hard limit to resource + a fudge factor. We 
can kill applications based on either directly observing soft limit breaches, 
or the quota warnings (need to check whether XFS will reset them if the task 
goes back under the soft limit).

We should think about how to make this behaviour configurable per-task, since I 
still want to support the non-destructive case for storage tasks that can 
manage their space.


was (Author: jamespeach):
Yeh, I think that using the soft limit is a pretty good idea. We can set the 
soft limit to the resources and the hard limit to resource + a fudge factor. We 
can kill applications based on either directly observing soft limit breaches, 
or the quota warnings (need to check whether XFS will reset them if the task 
goes back under the soft limit).

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-01-17 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329711#comment-16329711
 ] 

James Peach commented on MESOS-6575:


Yeh, I think that using the soft limit is a pretty good idea. We can set the 
soft limit to the resources and the hard limit to resource + a fudge factor. We 
can kill applications based on either directly observing soft limit breaches, 
or the quota warnings (need to check whether XFS will reset them if the task 
goes back under the soft limit).

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-01-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6575:
--

Assignee: James Peach

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-2921) Add move constructors / assignment to Result.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-2921:
--

Resolution: Fixed
  Assignee: Benjamin Mahler

{noformat}
commit 7d8ae37b022ac82c6945ff07ac980017f347b45e (HEAD -> master, apache/master, 
apache/HEAD, bmahler_result_move)
Author: Benjamin Mahler 
Date:   Wed Jan 17 15:28:24 2018 -0800

Added a default move constructor for Result.

Review: https://reviews.apache.org/r/65200
{noformat}

> Add move constructors / assignment to Result.
> -
>
> Key: MESOS-2921
> URL: https://issues.apache.org/jira/browse/MESOS-2921
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> Now that we have C++11, let's add move constructors and move assignment 
> operators for Result, similarly to what was done for Option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7016) Make default AWAIT_* duration configurable

2018-01-17 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329588#comment-16329588
 ] 

James Peach commented on MESOS-7016:


I have most of a patch that adds a global variable for the default timeout to 
{{libprocess}} and a Mesos test suite flag to configure it.

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: James Peach
>Priority: Major
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7016) Make default AWAIT_* duration configurable

2018-01-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-7016:
--

Assignee: James Peach

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: James Peach
>Priority: Major
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5817) Port libprocess process_tests.cpp

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5817:
---

Assignee: Eric Mumau  (was: Andrew Schwartzmeyer)

> Port libprocess process_tests.cpp
> -
>
> Key: MESOS-5817
> URL: https://issues.apache.org/jira/browse/MESOS-5817
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Eric Mumau
>Priority: Major
>  Labels: libprocess, mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5819) Port libprocess sequence_tests.cpp

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5819:
---

Assignee: Eric Mumau  (was: Andrew Schwartzmeyer)

> Port libprocess sequence_tests.cpp
> --
>
> Key: MESOS-5819
> URL: https://issues.apache.org/jira/browse/MESOS-5819
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Eric Mumau
>Priority: Major
>  Labels: libprocess, mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5818) Port libprocess reap_tests.cpp

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5818:
---

Assignee: Eric Mumau  (was: Andrew Schwartzmeyer)

> Port libprocess reap_tests.cpp
> --
>
> Key: MESOS-5818
> URL: https://issues.apache.org/jira/browse/MESOS-5818
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Eric Mumau
>Priority: Major
>  Labels: libprocess, mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5815) Port libprocess io_tests.cpp

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5815:
---

Assignee: Eric Mumau  (was: Andrew Schwartzmeyer)

> Port libprocess io_tests.cpp
> 
>
> Key: MESOS-5815
> URL: https://issues.apache.org/jira/browse/MESOS-5815
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Eric Mumau
>Priority: Major
>  Labels: libprocess, mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7988) Mesos attempts to open handle for the system idle process

2018-01-17 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329516#comment-16329516
 ] 

Gilbert Song commented on MESOS-7988:
-

I updated the fix version to 1.5.0

/cc [~jpe...@apache.org] [~andschwa]

> Mesos attempts to open handle for the system idle process
> -
>
> Key: MESOS-7988
> URL: https://issues.apache.org/jira/browse/MESOS-7988
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: windows
> Fix For: 1.5.0
>
>
> While running {{mesos-tests}} under Application Verifier, I found that we 
> were inadvertently attempting to get a handle for the System Idle Process. 
> This is not permitted by the OS, and so the {{OpenProcess}} system call was 
> failing. I further found that we were incorrectly checking the failure 
> condition of {{OpenProcess}}. We were attempting to open this handle when 
> opening  handles for all PIDs returned by {{os::pids}}, and the Windows API 
> {{EnumProcess}} includes PID 0 (System Idle Process). As this PID is not 
> useful, we can safely remove it from the {{os::pids}} API. Attempting to do 
> _anything_ with PID 0 will likely result in failure, as it is a special 
> process on Windows, and so we can help to prevent these errors by filtering 
> out PID 0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7988) Mesos attempts to open handle for the system idle process

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-7988:

Fix Version/s: 1.5.0

> Mesos attempts to open handle for the system idle process
> -
>
> Key: MESOS-7988
> URL: https://issues.apache.org/jira/browse/MESOS-7988
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: windows
> Fix For: 1.5.0
>
>
> While running {{mesos-tests}} under Application Verifier, I found that we 
> were inadvertently attempting to get a handle for the System Idle Process. 
> This is not permitted by the OS, and so the {{OpenProcess}} system call was 
> failing. I further found that we were incorrectly checking the failure 
> condition of {{OpenProcess}}. We were attempting to open this handle when 
> opening  handles for all PIDs returned by {{os::pids}}, and the Windows API 
> {{EnumProcess}} includes PID 0 (System Idle Process). As this PID is not 
> useful, we can safely remove it from the {{os::pids}} API. Attempting to do 
> _anything_ with PID 0 will likely result in failure, as it is a special 
> process on Windows, and so we can help to prevent these errors by filtering 
> out PID 0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6705) Port `fetcher_tests.cpp`

2018-01-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6705:
-
Fix Version/s: 1.5.0

> Port `fetcher_tests.cpp`
> 
>
> Key: MESOS-6705
> URL: https://issues.apache.org/jira/browse/MESOS-6705
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Jeff Coffler
>Priority: Major
>  Labels: microsoft, windows-mvp
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-6709) Enable HTTP and TCP health checks on Windows

2018-01-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6709:
-
Fix Version/s: 1.5.0

> Enable HTTP and TCP health checks on Windows
> 
>
> Key: MESOS-6709
> URL: https://issues.apache.org/jira/browse/MESOS-6709
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: John Kordich
>Priority: Blocker
>  Labels: microsoft, windows, windows-mvp
> Fix For: 1.5.0
>
>
> This specifically excludes command health checks, as these are blocked on the 
> IO Switchboard code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8144) Add a mock resource provider manager.

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8144:

Target Version/s:   (was: 1.5.0)

> Add a mock resource provider manager.
> -
>
> Key: MESOS-8144
> URL: https://issues.apache.org/jira/browse/MESOS-8144
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: storage
>
> To test a storage local resource provider, we need to inject a mock resource 
> provider manager such that:
> 1. A full agent will start during the test so the resource provider can 
> launch standalone containers for CSI plugins.
> 2. We can inject offer operations through the mock manager to test the 
> resource provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8360) Add operator API 'PRUNE_IMAGES' for manual container image GC.

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8360:

Fix Version/s: 1.5.0

> Add operator API 'PRUNE_IMAGES' for manual container image GC.
> --
>
> Key: MESOS-8360
> URL: https://issues.apache.org/jira/browse/MESOS-8360
> Project: Mesos
>  Issue Type: Improvement
>  Components: image-gc
>Reporter: Gilbert Song
>Assignee: Zhitao Li
>Priority: Major
>  Labels: image-gc, mesosphere, uber
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8456) Allocator should allow roles to burst above guarantees but below limits.

2018-01-17 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8456:
---

 Summary: Allocator should allow roles to burst above guarantees 
but below limits.
 Key: MESOS-8456
 URL: https://issues.apache.org/jira/browse/MESOS-8456
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


Currently, allocator only allocates resources for quota roles up to their 
guarantee in the first allocation stage. The allocator should continue 
allocating resources to these roles in the second stage below their quota 
limit. In other words, allocator should allow roles to burst above their 
guarantee but below the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5814) Port libprocess http_tests.cpp

2018-01-17 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5814:
---

Assignee: Eric Mumau  (was: Andrew Schwartzmeyer)

> Port libprocess http_tests.cpp
> --
>
> Key: MESOS-5814
> URL: https://issues.apache.org/jira/browse/MESOS-5814
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Eric Mumau
>Priority: Major
>  Labels: libprocess, mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1.

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8078:

Summary: Some fields went missing with no replacement in api/v1.  (was: 
Some fields went missing with no replacement in api/v1)

> Some fields went missing with no replacement in api/v1.
> ---
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8078:

Issue Type: Bug  (was: Story)

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8078:

Issue Type: Improvement  (was: Bug)

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7704) Remove use of #pragma comment (lib, "IPHLPAPI.lib").

2018-01-17 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-7704:

Summary: Remove use of #pragma comment (lib, "IPHLPAPI.lib").  (was: Remove 
use of #pragma comment(lib...)

> Remove use of #pragma comment (lib, "IPHLPAPI.lib").
> 
>
> Key: MESOS-7704
> URL: https://issues.apache.org/jira/browse/MESOS-7704
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
> Fix For: 1.5.0
>
>
> Commits 0a1e94d84, 5a7c8d8ef, and f7f661525 introduced the pattern of linking 
> to a Windows system library with the {{#pragma comment(lib, 
> "IPHLPAPI.lib")}}, instead of adding the library to the correct place in the 
> CMake build system.
> While this works, we should be consistent in the way we link to libraries, so 
> this code should be removed and {{IPHLPAPI.lib}} should be added to stout's 
> {{LFLAGS}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.

2018-01-17 Thread Meng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-8068:
---

Assignee: Meng Zhu

> Non-revocable bursting over quota guarantees via limits.
> 
>
> Key: MESOS-8068
> URL: https://issues.apache.org/jira/browse/MESOS-8068
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: multitenancy
>
> Prior to introducing a revocable tier of allocation (see MESOS-4441), there 
> is a notion of whether a role can burst over its quota guarantee.
> We currently apply implicit limits in the following way:
> No quota guarantee set: (guarantee 0, no limit)
> Quota guarantee set: (guarantee G, limit G)
> That is, we only allow support burst-only without guarantee and 
> guarantee-only without burst. We do not support bursting over some non-zero 
> guarantee: (guarantee G, limit L >= G).
> The idea here is that we should make these implicit limits explicit to 
> clarify for users the distinction between guarantees and limits, and to 
> support bursting over the guarantee.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8344) Improve JSON v1 operator API performance.

2018-01-17 Thread Meng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329272#comment-16329272
 ] 

Meng Zhu commented on MESOS-8344:
-

Evaluation results of operator API performance across different versions can be 
found here:

https://drive.google.com/open?id=1mN4OLEi7UdbLxv-Q3k3iMlmO6Gb2wx9X-gHRnDENSK0

> Improve JSON v1 operator API performance.
> -
>
> Key: MESOS-8344
> URL: https://issues.apache.org/jira/browse/MESOS-8344
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance
> Fix For: 1.5.0
>
>
> According to some user reports, a simple comparison of the v1 operator API 
> (using the "GET_TASKS" call) and the v0 /tasks HTTP endpoint shows that the 
> v1 API suffers from an inefficient implementation:
> {noformat: title=Curl Timing}
> Operator HTTP API (GET_TASKS): 0.02s user 0.08s system 1% cpu 9.883 total
> Old /tasks API: /tasks: 0.00s user 0.00s system 1% cpu 0.222 total
> {noformat}
> Looking over the implementation, it suffers from the same issues we 
> originally had with the JSON endpoints:
> * Excessive copying up the "tree" of state building calls.
> * Building up the state object as opposed to directly serializing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-5726) Benchmark the v1 Operator API

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5726:
---
Fix Version/s: 1.6.0

> Benchmark the v1 Operator API
> -
>
> Key: MESOS-5726
> URL: https://issues.apache.org/jira/browse/MESOS-5726
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> Just like what we did with the v1 framework API, we need to benchmark the 
> performance of v1 operator API.
> As part of this benchmarking, we should evaluate whether evolving 
> un-versioned protos to versioned protos in some of the API handlers (e.g., 
> getFrameworks) is expensive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8344) Improve JSON v1 operator API performance.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8344:
---
Summary: Improve JSON v1 operator API performance.  (was: Improve JSON v1 
operator API performance with jsonify.)

> Improve JSON v1 operator API performance.
> -
>
> Key: MESOS-8344
> URL: https://issues.apache.org/jira/browse/MESOS-8344
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance
>
> According to some user reports, a simple comparison of the v1 operator API 
> (using the "GET_TASKS" call) and the v0 /tasks HTTP endpoint shows that the 
> v1 API suffers from an inefficient implementation:
> {noformat: title=Curl Timing}
> Operator HTTP API (GET_TASKS): 0.02s user 0.08s system 1% cpu 9.883 total
> Old /tasks API: /tasks: 0.00s user 0.00s system 1% cpu 0.222 total
> {noformat}
> Looking over the implementation, it suffers from the same issues we 
> originally had with the JSON endpoints:
> * Excessive copying up the "tree" of state building calls.
> * Building up the state object as opposed to directly serializing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8344) Improve JSON v1 operator API performance with jsonify.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8344:
--

Assignee: Meng Zhu

> Improve JSON v1 operator API performance with jsonify.
> --
>
> Key: MESOS-8344
> URL: https://issues.apache.org/jira/browse/MESOS-8344
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance
>
> According to some user reports, a simple comparison of the v1 operator API 
> (using the "GET_TASKS" call) and the v0 /tasks HTTP endpoint shows that the 
> v1 API suffers from an inefficient implementation:
> {noformat: title=Curl Timing}
> Operator HTTP API (GET_TASKS): 0.02s user 0.08s system 1% cpu 9.883 total
> Old /tasks API: /tasks: 0.00s user 0.00s system 1% cpu 0.222 total
> {noformat}
> Looking over the implementation, it suffers from the same issues we 
> originally had with the JSON endpoints:
> * Excessive copying up the "tree" of state building calls.
> * Building up the state object as opposed to directly serializing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8455) Avoid unnecessary copying of protobuf in the v1 API.

2018-01-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329227#comment-16329227
 ] 

Benjamin Mahler edited comment on MESOS-8455 at 1/17/18 7:05 PM:
-

Master side of the copy elimination is in 1.5.0:

{noformat}
commit 3035c8828345e7b0f33a4755d50139c8b693d567 (bmahler_http_api_optimizations)
Author: Benjamin Mahler 
Date:   Tue Dec 19 17:43:14 2017 -0800

Eliminated some unnecessary copying in the HTTP operator API.

This is only a minor portion of the performance improvements
needed to bring the v1 operator API close to the v0 API
performance.

Review: https://reviews.apache.org/r/64741
{noformat}


was (Author: bmahler):
Master side of the copy elimination is in 1.5:

{noformat}
commit 3035c8828345e7b0f33a4755d50139c8b693d567 (bmahler_http_api_optimizations)
Author: Benjamin Mahler 
Date:   Tue Dec 19 17:43:14 2017 -0800

Eliminated some unnecessary copying in the HTTP operator API.

This is only a minor portion of the performance improvements
needed to bring the v1 operator API close to the v0 API
performance.

Review: https://reviews.apache.org/r/64741
{noformat}

> Avoid unnecessary copying of protobuf in the v1 API.
> 
>
> Key: MESOS-8455
> URL: https://issues.apache.org/jira/browse/MESOS-8455
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>Priority: Major
>
> Now that we have move support for protobufs, we can avoid the unnecessary 
> copying of protobuf in the v1 API to improve the performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8455) Avoid unnecessary copying of protobuf in the v1 API.

2018-01-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329227#comment-16329227
 ] 

Benjamin Mahler commented on MESOS-8455:


Master side of the copy elimination is in 1.5:

{noformat}
commit 3035c8828345e7b0f33a4755d50139c8b693d567 (bmahler_http_api_optimizations)
Author: Benjamin Mahler 
Date:   Tue Dec 19 17:43:14 2017 -0800

Eliminated some unnecessary copying in the HTTP operator API.

This is only a minor portion of the performance improvements
needed to bring the v1 operator API close to the v0 API
performance.

Review: https://reviews.apache.org/r/64741
{noformat}

> Avoid unnecessary copying of protobuf in the v1 API.
> 
>
> Key: MESOS-8455
> URL: https://issues.apache.org/jira/browse/MESOS-8455
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>Priority: Major
>
> Now that we have move support for protobufs, we can avoid the unnecessary 
> copying of protobuf in the v1 API to improve the performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8455) Avoid unnecessary copying of protobuf in the v1 API.

2018-01-17 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8455:
--

 Summary: Avoid unnecessary copying of protobuf in the v1 API.
 Key: MESOS-8455
 URL: https://issues.apache.org/jira/browse/MESOS-8455
 Project: Mesos
  Issue Type: Improvement
  Components: agent, master
Reporter: Benjamin Mahler


Now that we have move support for protobufs, we can avoid the unnecessary 
copying of protobuf in the v1 API to improve the performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5726) Benchmark the v1 Operator API

2018-01-17 Thread Meng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329200#comment-16329200
 ] 

Meng Zhu commented on MESOS-5726:
-

Evaluation result available at:

https://drive.google.com/open?id=1mN4OLEi7UdbLxv-Q3k3iMlmO6Gb2wx9X-gHRnDENSK0

> Benchmark the v1 Operator API
> -
>
> Key: MESOS-5726
> URL: https://issues.apache.org/jira/browse/MESOS-5726
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere
>
> Just like what we did with the v1 framework API, we need to benchmark the 
> performance of v1 operator API.
> As part of this benchmarking, we should evaluate whether evolving 
> un-versioned protos to versioned protos in some of the API handlers (e.g., 
> getFrameworks) is expensive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8344) Improve JSON v1 operator API performance.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8344:
---
Summary: Improve JSON v1 operator API performance.  (was: Improve v1 
operator API performance.)

> Improve JSON v1 operator API performance.
> -
>
> Key: MESOS-8344
> URL: https://issues.apache.org/jira/browse/MESOS-8344
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: performance
>
> According to some user reports, a simple comparison of the v1 operator API 
> (using the "GET_TASKS" call) and the v0 /tasks HTTP endpoint shows that the 
> v1 API suffers from an inefficient implementation:
> {noformat: title=Curl Timing}
> Operator HTTP API (GET_TASKS): 0.02s user 0.08s system 1% cpu 9.883 total
> Old /tasks API: /tasks: 0.00s user 0.00s system 1% cpu 0.222 total
> {noformat}
> Looking over the implementation, it suffers from the same issues we 
> originally had with the JSON endpoints:
> * Excessive copying up the "tree" of state building calls.
> * Building up the state object as opposed to directly serializing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8344) Improve v1 operator API performance.

2018-01-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8344:
---
Summary: Improve v1 operator API performance.  (was: V1 Operator API 
performance is much worse than v0.)

> Improve v1 operator API performance.
> 
>
> Key: MESOS-8344
> URL: https://issues.apache.org/jira/browse/MESOS-8344
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: performance
>
> According to some user reports, a simple comparison of the v1 operator API 
> (using the "GET_TASKS" call) and the v0 /tasks HTTP endpoint shows that the 
> v1 API suffers from an inefficient implementation:
> {noformat: title=Curl Timing}
> Operator HTTP API (GET_TASKS): 0.02s user 0.08s system 1% cpu 9.883 total
> Old /tasks API: /tasks: 0.00s user 0.00s system 1% cpu 0.222 total
> {noformat}
> Looking over the implementation, it suffers from the same issues we 
> originally had with the JSON endpoints:
> * Excessive copying up the "tree" of state building calls.
> * Building up the state object as opposed to directly serializing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-01-17 Thread Meng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu updated MESOS-8411:

Target Version/s: 1.6.0

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Critical
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8454) Add a download link for master and agent logs in WebUI

2018-01-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8454:
--
Sprint: Mesosphere Sprint 73

> Add a download link for master and agent logs in WebUI
> --
>
> Key: MESOS-8454
> URL: https://issues.apache.org/jira/browse/MESOS-8454
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Vinod Kone
>Assignee: Armand Grillet
>Priority: Major
>
> Just like task sandboxes, it would be great for us to provide a download link 
> for mesos and agent logs in the WebUI. Right now the the log link opens up 
> the pailer, which is not really convenient to do `grep` and such while 
> debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8454) Add a download link for master and agent logs in WebUI

2018-01-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8454:
--
Shepherd: Vinod Kone
Story Points: 3

> Add a download link for master and agent logs in WebUI
> --
>
> Key: MESOS-8454
> URL: https://issues.apache.org/jira/browse/MESOS-8454
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Vinod Kone
>Assignee: Armand Grillet
>Priority: Major
>
> Just like task sandboxes, it would be great for us to provide a download link 
> for mesos and agent logs in the WebUI. Right now the the log link opens up 
> the pailer, which is not really convenient to do `grep` and such while 
> debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8454) Add a download link for master and agent logs in WebUI

2018-01-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8454:
-

Assignee: Armand Grillet

> Add a download link for master and agent logs in WebUI
> --
>
> Key: MESOS-8454
> URL: https://issues.apache.org/jira/browse/MESOS-8454
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Vinod Kone
>Assignee: Armand Grillet
>Priority: Major
>
> Just like task sandboxes, it would be great for us to provide a download link 
> for mesos and agent logs in the WebUI. Right now the the log link opens up 
> the pailer, which is not really convenient to do `grep` and such while 
> debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8454) Add a download link for master and agent logs in WebUI

2018-01-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8454:
--
Component/s: webui

> Add a download link for master and agent logs in WebUI
> --
>
> Key: MESOS-8454
> URL: https://issues.apache.org/jira/browse/MESOS-8454
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Vinod Kone
>Priority: Major
>
> Just like task sandboxes, it would be great for us to provide a download link 
> for mesos and agent logs in the WebUI. Right now the the log link opens up 
> the pailer, which is not really convenient to do `grep` and such while 
> debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8454) Add a download link for master and agent logs in WebUI

2018-01-17 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8454:
-

 Summary: Add a download link for master and agent logs in WebUI
 Key: MESOS-8454
 URL: https://issues.apache.org/jira/browse/MESOS-8454
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone


Just like task sandboxes, it would be great for us to provide a download link 
for mesos and agent logs in the WebUI. Right now the the log link opens up the 
pailer, which is not really convenient to do `grep` and such while debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky.

2018-01-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7028:
---
Shepherd: Alexander Rukletsov
 Summary: NetSocketTest.EOFBeforeRecv is flaky.  (was: 
NetSocketTest.EOFBeforeRecv is flaky)

> NetSocketTest.EOFBeforeRecv is flaky.
> -
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329006#comment-16329006
 ] 

Andrei Budnik commented on MESOS-7742:
--

These patches ^^ are fixing the first cause described in the [first 
patch|https://reviews.apache.org/r/65122/].

There is a second cause when an attempt to connect to IO-Switchboard fails with:
{code:java}
I1109 23:47:25.016929 27803 process.cpp:3982] Failed to process request for 
'/slave(812)/api/v1': Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused
W1109 23:47:25.017009 27803 http.cpp:2944] Failed to attach to nested container 
7ab572dd-78b5-4186-93af-7ac011990f80.b77944da-f1d5-4694-a51b-8fde150c5f7a: 
Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused
I1109 23:47:25.017063 27803 process.cpp:1590] Returning '500 Internal Server 
Error' for '/slave(812)/api/v1' (Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused)
{code}
The reason for this failure needs to be investigated.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7742:
---
Affects Version/s: 1.5.0
  Component/s: agent

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7742:
---
Shepherd: Alexander Rukletsov  (was: Vinod Kone)

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8451) Unhandled Interference between registration and reregistration code paths

2018-01-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8451:
--

 Summary: Unhandled Interference between registration and 
reregistration code paths
 Key: MESOS-8451
 URL: https://issues.apache.org/jira/browse/MESOS-8451
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Right now, the code paths for agent registration and agent re-registration run 
independent of each other, probably on the assumption that re-registration 
requires an agent ID from the master which is only given out after successful 
registration, so the code paths cannot interfere.

 

However, it is not so hard to construct some examples where this fails, e.g.

 

- Agent sends out registration message 1

- Timeout expires, agent sends out registration message 2

- Agent gets registration message 1, updates agent id, is restarted

- Agent send reregistration message 1 after restart

 

 

Most likely, a proper solution will require to introduce some kind of counter 
or uuid to the (re-)registration messages, which is also required for proper 
handling of multiple reregistration messages as described in MESOS-8273.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-01-17 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328763#comment-16328763
 ] 

Benjamin Bannier commented on MESOS-8424:
-

{noformat}
commit f6d7cd6da41b0d43c0923dea35531775850c0b5e
Author: Jan Schlicht 
Date:   Wed Jan 17 13:50:23 2018 +0100

Moved agent response code into 'protobuf_utils.cpp'.

Review: https://reviews.apache.org/r/65043/
{noformat}

> Test that operations are correctly reported following a master failover
> ---
>
> Key: MESOS-8424
> URL: https://issues.apache.org/jira/browse/MESOS-8424
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>
> As the master keeps track of operations running on a resource provider, it 
> needs to be updated on these operations when agents reregister after a master 
> failover. E.g., an operation that has finished during the failover should be 
> reported as finished by the master after the agent on which the resource 
> provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8450) SlaveInfo comparison is unnecessarily expensive

2018-01-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8450:
--

 Summary: SlaveInfo comparison is unnecessarily expensive
 Key: MESOS-8450
 URL: https://issues.apache.org/jira/browse/MESOS-8450
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Currently, the comparison operator of `struct SlaveInfo` is creating two 
temporary `Resources` objects and two temporary `Attributes` objects. All of 
these constructors do a bunch of work and allocate memory.

 

Instead of passing around `SlaveInfo` in the master, we should probably use 
some wrapper that stores the raw message as well as caching the lazily 
generated `Resources` and `Attributes` objects associated with that `SlaveInfo`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8447) Incomplete output of apply-reviews.py --dru-run

2018-01-17 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-8447:
---

Assignee: Benjamin Bannier

> Incomplete output of apply-reviews.py --dru-run
> ---
>
> Key: MESOS-8447
> URL: https://issues.apache.org/jira/browse/MESOS-8447
> Project: Mesos
>  Issue Type: Bug
>  Components: build
> Environment:  
> {noformat}
>  {noformat}
>  
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>
> The script {{support/apply-reviews.py}} has a flag {{--dry-run}} which should 
> dump the commands which would be performed. This flag is useful to e.g., 
> reorder patch chains or to manually resolve intermediate conflicts while 
> still being able to pull a full chain.
> The output looks like this
> {noformat}
> % ./support/apply-reviews.py -r 62447 -c -n -3 --dry-run
> wget --no-check-certificate --no-verbose -O 62160.patch 
> https://reviews.apache.org/r/62160/diff/raw/
> git apply --index 62160.patch --3way
> git commit --author "Benno Evers " -aF "62160.message"
> wget --no-check-certificate --no-verbose -O 62161.patch 
> https://reviews.apache.org/r/62161/diff/raw/
> git apply --index 62161.patch --3way
> git commit --author "Benno Evers " -aF "62161.message"
> wget --no-check-certificate --no-verbose -O 62444.patch 
> https://reviews.apache.org/r/62444/diff/raw/
> git apply --index 62444.patch --3way
> git commit --author "Benno Evers " -aF "62444.message"
> wget --no-check-certificate --no-verbose -O 62445.patch 
> https://reviews.apache.org/r/62445/diff/raw/
> git apply --index 62445.patch --3way
> git commit --author "Benno Evers " -aF "62445.message"
> wget --no-check-certificate --no-verbose -O 62447.patch 
> https://reviews.apache.org/r/62447/diff/raw/
> git apply --index 62447.patch --3way
> git commit --author "Benno Evers " -aF 
> "62447.message"{noformat}
> Trying to replay that dry run leads to an error since the commands to create 
> the commit message files are not printed.
> We should add these commands to the output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8314) Add authorization to display of resource provider information in API calls and endpoints

2018-01-17 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8314:

Description: The {{GET_RESOURCE_PROVIDERS}} call is used to list all 
resource providers known to a Mesos agent. We akso display resource provider 
infos for the master's {{GET_AGENTS}} call. These call needs to be authorized.  
(was: The {{GET_RESOURCE_PROVIDERS}} call is used to list all resource 
providers known to a Mesos master or agent. This call needs to be authorized.)

> Add authorization to display of resource provider information in API calls 
> and endpoints
> 
>
> Key: MESOS-8314
> URL: https://issues.apache.org/jira/browse/MESOS-8314
> Project: Mesos
>  Issue Type: Task
>  Components: agent, HTTP API, master
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: csi-post-mvp
>
> The {{GET_RESOURCE_PROVIDERS}} call is used to list all resource providers 
> known to a Mesos agent. We akso display resource provider infos for the 
> master's {{GET_AGENTS}} call. These call needs to be authorized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8314) Add authorization to display of resource provider information in API calls and endpoints

2018-01-17 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8314:

Component/s: master
 agent

> Add authorization to display of resource provider information in API calls 
> and endpoints
> 
>
> Key: MESOS-8314
> URL: https://issues.apache.org/jira/browse/MESOS-8314
> Project: Mesos
>  Issue Type: Task
>  Components: agent, HTTP API, master
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: csi-post-mvp
>
> The {{GET_RESOURCE_PROVIDERS}} call is used to list all resource providers 
> known to a Mesos master or agent. This call needs to be authorized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)