[jira] [Assigned] (MESOS-6636) Validate that tasks / executors / reservations do not mix Resource.allocation_info.roles.

2016-12-07 Thread Jay Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Guo reassigned MESOS-6636:
--

Assignee: Jay Guo

> Validate that tasks / executors / reservations do not mix 
> Resource.allocation_info.roles.
> -
>
> Key: MESOS-6636
> URL: https://issues.apache.org/jira/browse/MESOS-6636
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Jay Guo
>
> With support for multi-role frameworks, we need to make sure that individual 
> tasks and executors cannot mix roles. Likewise, we do not want to allow a 
> scheduler to make a reservation based on resources with different allocated 
> roles.
> We will however allow tasks from one role to run on executors from another 
> role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6637) Validate that schedulers cannot perform operations on offers with different allocation roles.

2016-12-07 Thread Jay Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Guo reassigned MESOS-6637:
--

Assignee: Jay Guo

> Validate that schedulers cannot perform operations on offers with different 
> allocation roles.
> -
>
> Key: MESOS-6637
> URL: https://issues.apache.org/jira/browse/MESOS-6637
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Jay Guo
>
> With support for multi-role frameworks, offers contain allocation info 
> (currently just the role that the offer is being made to).
> In theory, schedulers could perform offer operations across multiple roles, 
> so long as the tasks, executors, and reservations individually don't mix 
> roles. However, there doesn't seem to be a clear reason to allow this. So, we 
> will validate against combining offers from multiple roles. This also makes 
> it semantically consistent with single-role frameworks (since they do not do 
> this either).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6742) Adding support for s390x architecture

2016-12-07 Thread Ayanampudi Varsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayanampudi Varsha updated MESOS-6742:
-
Description: 
There are 2 issues:
1. LdcacheTest.Parse test case fails on s390x machines.
2. From the value of flag docker_registry in slave/flags.cpp, amd64 images get 
downloaded due to which test cases fail on s390x with "Exec format Error"

  was:
There are 2 issues:
1. LdcacheTest.Parse test case fails on s390x machines.
2. From the value of flag docker_registry in slave.cpp, amd64 images get 
downloaded due to which test cases fail on s390x with "Exec format Error"


> Adding support for s390x architecture 
> --
>
> Key: MESOS-6742
> URL: https://issues.apache.org/jira/browse/MESOS-6742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ayanampudi Varsha
>
> There are 2 issues:
> 1. LdcacheTest.Parse test case fails on s390x machines.
> 2. From the value of flag docker_registry in slave/flags.cpp, amd64 images 
> get downloaded due to which test cases fail on s390x with "Exec format Error"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6695) Light up Windows agent tests

2016-12-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730900#comment-15730900
 ] 

Joseph Wu commented on MESOS-6695:
--

{code}
commit 1648491e2f194f5ba9d62cb1e099066fb7f16272
Author: Alex Clemmer 
Date:   Wed Dec 7 16:13:22 2016 -0800

Changed registrar backend to `in_memory` by default in tests.

Currently, all instances of the Master in tests set the
`--registry` flag to the default value (`replicated_log`).
When the `replicated_log` value is set, Masters in tests will
back the registrar with the disk, specifically via levelDB.

Only a small subset of tests actually require the `replicated_log`;
these are tests which expect the master to persist data across
failovers.  A majority of tests can be run with an `in_memory`
registrar backend.  Changing the default to `in_memory` will
serve multiple purposes:
* It will speed up the test suite by ~10-15%.
* It will reduce the flakiness observed on the ASF CI.
  These machines sometimes run into disk contention, which causes
  registrar reads/write to time out.
* It will unblock a majority of tests from being run on Windows,
  which currently does not implement a persistent registrar backend.

This review supercedes and revives: https://reviews.apache.org/r/41665/

Review: https://reviews.apache.org/r/54453/
{code}

> Light up Windows agent tests
> 
>
> Key: MESOS-6695
> URL: https://issues.apache.org/jira/browse/MESOS-6695
> Project: Mesos
>  Issue Type: Epic
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3447) Port svn_tests

2016-12-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730898#comment-15730898
 ] 

Joseph Wu commented on MESOS-3447:
--

{code}
commit b5b1ead3a8c28b8b65f514fd7b030324a735a26d
Author: Alex Clemmer 
Date:   Wed Dec 7 16:57:37 2016 -0800

Windows: Added APR include path to libprocess configuration.

Partially addresses MESOS-3447, as APR is a dependency of the SVN
facilities of Stout.

On Unix builds, APR is expected to have been installed on the system
prior to building Mesos (usually by a package manager). Since Windows
does not have a package manager or a reasonble way of automatically
discovering where a package is installed (aside from the registry), our
CMake build system takes it upon itself to manage these system
dependencies.  This means that on Windows, we need to configure the
build to look for the APR headers in our custom-downloaded APR
repository.  Currently, though, we are not doing this, so when we'll
hit a compile-time error if we try to build (e.g.) `svn.hpp`.

This commit will introduce the APR include paths as part of the build
against Stout. Since Stout is a header-only library, it is (right now)
incumbent on whoever is bundling Stout up to manage the third-party
dependencies of Stout. In our current implementation, libprocess manages
the APR dependency for Stout, hence, we put this logic in libprocess.

Review: https://reviews.apache.org/r/54462/
{code}

> Port svn_tests
> --
>
> Key: MESOS-3447
> URL: https://issues.apache.org/jira/browse/MESOS-3447
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout
>
> Should be trivial if we have libapr and libsvn building and linking correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6717) Add Windows support to agent test harness

2016-12-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730897#comment-15730897
 ] 

Joseph Wu commented on MESOS-6717:
--

{code}
commit 4c0e453296e3ac7c5eda48a98eb7ad570c303d0a
Author: Alex Clemmer 
Date:   Wed Dec 7 17:21:01 2016 -0800

Windows: Fixed default isolators in Agent.

This commit sets the default isolators on Windows to Windows-specific
values, rather than POSIX-specific values.  This is a convenience for
users on Windows (whom no longer need to specify
`--isolation=windows/cpu,filesystem/windows`) and will allow tests
to exercise the default set of Agent flags.

In particular, this commit will transition Windows builds of the agent
away from using the `posix/cpu`, `posix/mem`, and `filesystem/posix`
isolators by default, replacing them with `windows/cpu` and
`filesystem/windows` (sadly, there is not yet a memory isolator for
Windows).

Review: https://reviews.apache.org/r/54470/
{code}

> Add Windows support to agent test harness
> -
>
> Key: MESOS-6717
> URL: https://issues.apache.org/jira/browse/MESOS-6717
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: microsoft, windows-mvp
>
> Of particular interest is in `src/tests/CMakeLists.txt` is support enough of 
> the following that we can successfully run agent tests:
> TEST_HELPER_SRC
> MESOS_TESTS_UTILS_SRC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6756) I/O switchboard should deal with the case when reaping of the server failed.

2016-12-07 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6756:
--
Story Points: 3

> I/O switchboard should deal with the case when reaping of the server failed.
> 
>
> Key: MESOS-6756
> URL: https://issues.apache.org/jira/browse/MESOS-6756
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Jie Yu
> Fix For: 1.2.0
>
>
> Currently, we don't deal with the reaping failure, which we should.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6750) Metrics on the Agent view of the Mesos web UI flickers between empty and non-empty states

2016-12-07 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730835#comment-15730835
 ] 

haosdent commented on MESOS-6750:
-

Thanks a lot! Let me verify your patch :- )

> Metrics on the Agent view of the Mesos web UI flickers between empty and 
> non-empty states
> -
>
> Key: MESOS-6750
> URL: https://issues.apache.org/jira/browse/MESOS-6750
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Joseph Wu
>Assignee: haosdent
>Priority: Minor
> Attachments: patch.diff
>
>
> When viewing a specific agent on the Mesos WebUI, the metrics panel on the 
> left side of the UI will alternate between having values and being empty.
> This is due to two different callbacks that run:
> * This one sets the metrics into the {{$scope.state}} variable: 
> https://github.com/apache/mesos/blob/1.1.x/src/webui/master/static/js/controllers.js#L564-L577
> * This one blows away the {{$scope.state}} in favor of a new one: 
> https://github.com/apache/mesos/blob/1.1.x/src/webui/master/static/js/controllers.js#L521
> The metrics callback should simply assign to a different variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6757) Consider using CMake to configure test scripts in the `bin/` diretory

2016-12-07 Thread Alex Clemmer (JIRA)
Alex Clemmer created MESOS-6757:
---

 Summary: Consider using CMake to configure test scripts in the 
`bin/` diretory
 Key: MESOS-6757
 URL: https://issues.apache.org/jira/browse/MESOS-6757
 Project: Mesos
  Issue Type: Bug
  Components: cmake
Reporter: Alex Clemmer
Assignee: Alex Clemmer






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6756) I/O switchboard should deal with the case when reaping of the server failed.

2016-12-07 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6756:
-

 Summary: I/O switchboard should deal with the case when reaping of 
the server failed.
 Key: MESOS-6756
 URL: https://issues.apache.org/jira/browse/MESOS-6756
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu
Assignee: Jie Yu


Currently, we don't deal with the reaping failure, which we should.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5646) Build `network/cni` isolator with `libnl` support

2016-12-07 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730703#comment-15730703
 ] 

Qian Zhang commented on MESOS-5646:
---

[~avin...@mesosphere.io], I am not working on it now, please feel free to take 
it over :-)

> Build `network/cni` isolator with `libnl` support
> -
>
> Key: MESOS-5646
> URL: https://issues.apache.org/jira/browse/MESOS-5646
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Qian Zhang
>  Labels: mesosphere
>
> Currently, the `network/cni` isolator does not have the ability to collect 
> network statistics for containers launched on a CNI network. We need to give 
> the `network/cni` isolator the ability to query interfaces, route tables and 
> statistics in the containers network namespace. To achieve this the 
> `network/cni` isolator will need to talk `netlink`.
> For enabling `netlink` API we need the `network/cni` isolator to be built 
> with libnl support. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6686) Add comments about the meanings of TaskStatus.Reason in mesos.proto

2016-12-07 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6686:
-
 Labels: documentation newbie  (was: )
Component/s: documentation

> Add comments about the meanings of TaskStatus.Reason in mesos.proto
> ---
>
> Key: MESOS-6686
> URL: https://issues.apache.org/jira/browse/MESOS-6686
> Project: Mesos
>  Issue Type: Improvement
>  Components: documentation
>Reporter: haosdent
>Priority: Minor
>  Labels: documentation, newbie
>
> Some enums in {{TaskStatus.Reason}} are not clear and we should add some 
> comments in it to describe what it means and when it happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5646) Build `network/cni` isolator with `libnl` support

2016-12-07 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730479#comment-15730479
 ] 

Avinash Sridharan commented on MESOS-5646:
--

Hi Qian,
 Are you still working on the review? See that [~jieyu] has a comment but not 
much progress on the review. Wanted to finish this ticket up so that we can 
progress on finishing support for network statistics. If you don't have cycles 
will take over.

Thanks,
Avinash

> Build `network/cni` isolator with `libnl` support
> -
>
> Key: MESOS-5646
> URL: https://issues.apache.org/jira/browse/MESOS-5646
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Qian Zhang
>  Labels: mesosphere
>
> Currently, the `network/cni` isolator does not have the ability to collect 
> network statistics for containers launched on a CNI network. We need to give 
> the `network/cni` isolator the ability to query interfaces, route tables and 
> statistics in the containers network namespace. To achieve this the 
> `network/cni` isolator will need to talk `netlink`.
> For enabling `netlink` API we need the `network/cni` isolator to be built 
> with libnl support. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hierarchy.

2016-12-07 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730467#comment-15730467
 ] 

Avinash Sridharan commented on MESOS-5533:
--

[~karya] can we mark this as "Resolved" "Not reproducible". Haven't seen this 
being hit in the CI for quite some time, and we don't have any data to make 
progress on this?

> Agent fails to start on CentOS 6 due to missing cgroup hierarchy.
> -
>
> Key: MESOS-5533
> URL: https://issues.apache.org/jira/browse/MESOS-5533
> Project: Mesos
>  Issue Type: Bug
>  Components: build, isolation
>Reporter: Kapil Arya
>Assignee: Jie Yu
>  Labels: mesosphere
>
> With the network CNI isolator, agent now _requires_ cgroups to be installed 
> on the system. Can we add some check(s) to either automatically disable CNI 
> module if cgroup hierarchies are not available or ask the user to 
> install/enable cgroup hierarchies.
> On CentOS 6, cgroup tools aren't installed by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5647) Expose network statistics for containers on CNI network in the `network/cni` isolator.

2016-12-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5647:
-
Summary: Expose network statistics for containers on CNI network in the 
`network/cni` isolator.  (was: Expose a network statistics in the `network/cni` 
isolator.)

> Expose network statistics for containers on CNI network in the `network/cni` 
> isolator.
> --
>
> Key: MESOS-5647
> URL: https://issues.apache.org/jira/browse/MESOS-5647
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> We need to implement the `usage` method in the `network/cni` isolator to 
> expose metrics relating to a containers network traffic. 
> On receiving a request for getting `usage` for a a given container the 
> `network/cni` isolator could use NETLINK system calls to query the kernel for 
> interface and routing statistics for a given container's network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6755) Capturing various tickets for improving CNI support for `MesosContainerizer`

2016-12-07 Thread Avinash Sridharan (JIRA)
Avinash Sridharan created MESOS-6755:


 Summary: Capturing various tickets for improving CNI support for 
`MesosContainerizer`
 Key: MESOS-6755
 URL: https://issues.apache.org/jira/browse/MESOS-6755
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Avinash Sridharan
Assignee: Avinash Sridharan


This is a ticket to capture the ongoing effort to improve CNI support for 
`MesosContainerizer`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5647) Expose a network statistics in the `network/cni` isolator.

2016-12-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5647:
-
Description: 
We need to implement the `usage` method in the `network/cni` isolator to expose 
metrics relating to a containers network traffic. 

On receiving a request for getting `usage` for a a given container the 
`network/cni` isolator could use NETLINK system calls to query the kernel for 
interface and routing statistics for a given container's network namespace.

  was:
We need a statistics endpoint in the `network/cni` isolator to expose metrics 
relating to a containers network traffic. 

On receiving a request for a given container the `network/cni` isolator could 
use NETLINK system calls to query the kernel for interface and routing 
statistics for a given container's network namespace.


> Expose a network statistics in the `network/cni` isolator.
> --
>
> Key: MESOS-5647
> URL: https://issues.apache.org/jira/browse/MESOS-5647
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> We need to implement the `usage` method in the `network/cni` isolator to 
> expose metrics relating to a containers network traffic. 
> On receiving a request for getting `usage` for a a given container the 
> `network/cni` isolator could use NETLINK system calls to query the kernel for 
> interface and routing statistics for a given container's network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5647) Expose a network statistics in the `network/cni` isolator.

2016-12-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan reassigned MESOS-5647:


Assignee: Avinash Sridharan  (was: Qian Zhang)

> Expose a network statistics in the `network/cni` isolator.
> --
>
> Key: MESOS-5647
> URL: https://issues.apache.org/jira/browse/MESOS-5647
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> We need a statistics endpoint in the `network/cni` isolator to expose metrics 
> relating to a containers network traffic. 
> On receiving a request for a given container the `network/cni` isolator could 
> use NETLINK system calls to query the kernel for interface and routing 
> statistics for a given container's network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5647) Expose a network statistics in the `network/cni` isolator.

2016-12-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5647:
-
Summary: Expose a network statistics in the `network/cni` isolator.  (was: 
Expose a statistics endpoint on the `network/cni` isolator.)

> Expose a network statistics in the `network/cni` isolator.
> --
>
> Key: MESOS-5647
> URL: https://issues.apache.org/jira/browse/MESOS-5647
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Qian Zhang
>  Labels: mesosphere
>
> We need a statistics endpoint in the `network/cni` isolator to expose metrics 
> relating to a containers network traffic. 
> On receiving a request for a given container the `network/cni` isolator could 
> use NETLINK system calls to query the kernel for interface and routing 
> statistics for a given container's network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6567) Actively Scan for CNI Configurations

2016-12-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan reassigned MESOS-6567:


Assignee: Avinash Sridharan

> Actively Scan for CNI Configurations
> 
>
> Key: MESOS-6567
> URL: https://issues.apache.org/jira/browse/MESOS-6567
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Dan Osborne
>Assignee: Avinash Sridharan
>
> Mesos-Agent currently loads the CNI configs into memory at startup. After 
> this point, new configurations that are added will remain unknown to the 
> Mesos Agent process until it is restarted.
> This ticket is to request that the Mesos Agent process can the CNI config 
> directory each time it is networking a task, so that modifying, adding, and 
> removing networks will not require a slave reboot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6754) Include command in task's state.json entry

2016-12-07 Thread Michael Gummelt (JIRA)
Michael Gummelt created MESOS-6754:
--

 Summary: Include command in task's state.json entry
 Key: MESOS-6754
 URL: https://issues.apache.org/jira/browse/MESOS-6754
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Michael Gummelt


I often would like to determine which command a task is running w/o having to 
SSH into the box and {{ps}}.  I'm currently doing this for HDFS, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6665) io::redirect might cause stack overflow.

2016-12-07 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15730198#comment-15730198
 ] 

Adam B commented on MESOS-6665:
---

Any update [~benjaminhindman]?

> io::redirect might cause stack overflow.
> 
>
> Key: MESOS-6665
> URL: https://issues.apache.org/jira/browse/MESOS-6665
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Benjamin Hindman
>
> Can reproduce this on macOS sierra:
> {noformat}
> [--] 6 tests from IOTest
> [ RUN  ] IOTest.Poll
> [   OK ] IOTest.Poll (0 ms)
> [ RUN  ] IOTest.Read
> [   OK ] IOTest.Read (3 ms)
> [ RUN  ] IOTest.BufferedRead
> [   OK ] IOTest.BufferedRead (5 ms)
> [ RUN  ] IOTest.Write
> [   OK ] IOTest.Write (1 ms)
> [ RUN  ] IOTest.Redirect
> make[6]: *** [check-local] Illegal instruction: 4
> make[5]: *** [check-am] Error 2
> make[4]: *** [check-recursive] Error 1
> make[3]: *** [check] Error 2
> make[2]: *** [check-recursive] Error 1
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> (reverse-i-search)`k': make check -j3
> Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
> (lldb) target create "3rdparty/libprocess/libprocess-tests"
> Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
> (lldb) run --gtest_filter=IOTest.Redirect
> Process 26064 launched: 
> '/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
> (x86_64)
> Note: Google Test filter = IOTest.Redirect
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOTest
> [ RUN  ] IOTest.Redirect
> Process 26064 stopped
> * thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
> EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
> frame #0: 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78
> libsystem_malloc.dylib`szone_malloc_should_clear:
> ->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
> 0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
> 0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
> 0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
> (lldb) bt
> .
> frame #2794: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
> {noformat}
> Change the test to redirect just 1KB data will hide the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6753) Refactor duplicated code for framework registration in master

2016-12-07 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6753:
---
Description: 
It would be nice to refactor the code and eliminate some/all of these 
redundancies:

* {{Master::activateRecoveredFramework}} and {{Master::\_failoverFramework}} 
duplicate some code.
* {{Master::\_subscribe}} for PID-based schedulers has a code path that 
contains code that is _very_ similar to the {{Master::failoverFramework}} 
logic, but is not identical.
* The logic around {{updateConnection}} could stand to be cleaned up. e.g., it 
seems like {{updateConnection}} could/should be responsible for linking to the 
target PID and/or setting up the {{closed}} callback (for PID or HTTP 
schedulers, respectively).

  was:
It would be nice to refactor the code and eliminate some/all of these 
redundancies:

* {{Master::activateRecoveredFramework}} and {{Master::_failoverFramework}} 
duplicate some code.
* {{Master::_subscribe}} for PID-based schedulers has a code path that contains 
code that is _very_ similar to the {{Master::failoverFramework}} logic, but is 
not identical.
* The logic around {{updateConnection}} could stand to be cleaned up. e.g., it 
seems like {{updateConnection}} could/should be responsible for linking to the 
target PID and/or setting up the {{closed}} callback (for PID or HTTP 
schedulers, respectively).


> Refactor duplicated code for framework registration in master
> -
>
> Key: MESOS-6753
> URL: https://issues.apache.org/jira/browse/MESOS-6753
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> It would be nice to refactor the code and eliminate some/all of these 
> redundancies:
> * {{Master::activateRecoveredFramework}} and {{Master::\_failoverFramework}} 
> duplicate some code.
> * {{Master::\_subscribe}} for PID-based schedulers has a code path that 
> contains code that is _very_ similar to the {{Master::failoverFramework}} 
> logic, but is not identical.
> * The logic around {{updateConnection}} could stand to be cleaned up. e.g., 
> it seems like {{updateConnection}} could/should be responsible for linking to 
> the target PID and/or setting up the {{closed}} callback (for PID or HTTP 
> schedulers, respectively).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6753) Refactor duplicated code for framework registration in master

2016-12-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6753:
--

 Summary: Refactor duplicated code for framework registration in 
master
 Key: MESOS-6753
 URL: https://issues.apache.org/jira/browse/MESOS-6753
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Neil Conway


It would be nice to refactor the code and eliminate some/all of these 
redundancies:

* {{Master::activateRecoveredFramework}} and {{Master::_failoverFramework}} 
duplicate some code.
* {{Master::_subscribe}} for PID-based schedulers has a code path that contains 
code that is _very_ similar to the {{Master::failoverFramework}} logic, but is 
not identical.
* The logic around {{updateConnection}} could stand to be cleaned up. e.g., it 
seems like {{updateConnection}} could/should be responsible for linking to the 
target PID and/or setting up the {{closed}} callback (for PID or HTTP 
schedulers, respectively).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6752) Add a `post()` overload to libprocess for streaming requests

2016-12-07 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6752:
-

 Summary: Add a `post()` overload to libprocess for streaming 
requests
 Key: MESOS-6752
 URL: https://issues.apache.org/jira/browse/MESOS-6752
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API, libprocess
Reporter: Anand Mazumdar


Currently, the {{post}}/{{streaming::post}} overloads in [libprocess | 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp]
 don't work for streaming requests. The {{streaming::post}} overload works only 
for streaming responses. We should add another overload to handle streaming 
requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6746) IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT

2016-12-07 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6746:
--
Shepherd: Vinod Kone
  Sprint: Mesosphere Sprint 47

> IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT
> 
>
> Key: MESOS-6746
> URL: https://issues.apache.org/jira/browse/MESOS-6746
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Klues
>Assignee: Anand Mazumdar
>  Labels: debugging, mesosphere
> Fix For: 1.2.0
>
>
> Currently we are doing a close on the write end of all connection pipes when 
> we exit the switchboard, but we don't wait until the read is flushed before 
> exiting. This can cause some data to get dropped since the process may exit 
> before the reader is flushed.  The current code is:
> {noformat}
> void IOSwitchboardServerProcess::finalize()   
> { 
>   foreach (HttpConnection& connection, outputConnections) {   
> connection.close();  
>   }   
>   
>   if (failure.isSome()) {
> promise.fail(failure->message);   
>   } else {
> promise.set(Nothing());   
>   }   
> } 
> {noformat}
> We should change it to:
> {noformat}
> void IOSwitchboardServerProcess::finalize()   
> { 
>   foreach (HttpConnection& connection, outputConnections) {   
> connection.close();
> connection.closed().await();  
>   }   
>   
>   if (failure.isSome()) {
> promise.fail(failure->message);   
>   } else {
> promise.set(Nothing());   
>   }   
> } 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6635) Update allocator to handle multi-role frameworks.

2016-12-07 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6635:
--

Assignee: Benjamin Mahler  (was: Jay Guo)

> Update allocator to handle multi-role frameworks.
> -
>
> Key: MESOS-6635
> URL: https://issues.apache.org/jira/browse/MESOS-6635
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> The allocator needs to be adjusted once we allow frameworks to have multiple 
> roles:
> (1) When adding a framework, we need to store all of its roles and add it to 
> multiple role sorters.
> (2) We will CHECK that the framework does not modify its roles when updating 
> the framework (much like we do for single-role frameworks).
> (3) When performing an allocation, the allocator will set 
> allocation_info.role. When recovering resources, the allocator will unset 
> allocation_info.role.
> (4) The allocator will send AllocationInfo alongside offers that it sends to 
> the master, so that the master can easily augment {{Offer}} with allocation 
> info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6751) Mesos should allow for selective environment inheritance.

2016-12-07 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-6751:
-

 Summary: Mesos should allow for selective environment inheritance.
 Key: MESOS-6751
 URL: https://issues.apache.org/jira/browse/MESOS-6751
 Project: Mesos
  Issue Type: Improvement
Reporter: Till Toenshoff


We have often run into issues with environment variables inherited by 
subprocesses which in certain setups cause problems.
VERY recent examples are: 
- MESOS-6747
- MESOS-6748

The pattern for solving an inheritance that covers bases like PATH,  
LD_LIBRARY_PATH and DYLD_LIBRARY_PATH but at the same time carves out traps 
like LIBPROCESS_-related variables and maybe also MESOS_-related variables is 
relatively simple. 

{noformat}
  map environment;
  foreachpair (const string& key, const string& value, os::environment()) {
if (!strings::startsWith(key, "LIBPROCESS_") &&
!strings::startsWith(key, "MESOS_")) {
  environment.emplace(key, value);
}
  }
{noformat}

But maybe we can somehow force the use of such pattern to make this kind of bug 
less frequent on new code that forks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6750) Metrics on the Agent view of the Mesos web UI flickers between empty and non-empty states

2016-12-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6750:
-
Attachment: patch.diff

Attached a diff to show the parts of the JS that will probably be affected.

> Metrics on the Agent view of the Mesos web UI flickers between empty and 
> non-empty states
> -
>
> Key: MESOS-6750
> URL: https://issues.apache.org/jira/browse/MESOS-6750
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Joseph Wu
>Assignee: haosdent
>Priority: Minor
> Attachments: patch.diff
>
>
> When viewing a specific agent on the Mesos WebUI, the metrics panel on the 
> left side of the UI will alternate between having values and being empty.
> This is due to two different callbacks that run:
> * This one sets the metrics into the {{$scope.state}} variable: 
> https://github.com/apache/mesos/blob/1.1.x/src/webui/master/static/js/controllers.js#L564-L577
> * This one blows away the {{$scope.state}} in favor of a new one: 
> https://github.com/apache/mesos/blob/1.1.x/src/webui/master/static/js/controllers.js#L521
> The metrics callback should simply assign to a different variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6750) Metrics on the Agent view of the Mesos web UI flickers between empty and non-empty states

2016-12-07 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-6750:


 Summary: Metrics on the Agent view of the Mesos web UI flickers 
between empty and non-empty states
 Key: MESOS-6750
 URL: https://issues.apache.org/jira/browse/MESOS-6750
 Project: Mesos
  Issue Type: Bug
  Components: webui
Affects Versions: 1.1.0, 1.0.2
Reporter: Joseph Wu
Assignee: haosdent
Priority: Minor


When viewing a specific agent on the Mesos WebUI, the metrics panel on the left 
side of the UI will alternate between having values and being empty.

This is due to two different callbacks that run:
* This one sets the metrics into the {{$scope.state}} variable: 
https://github.com/apache/mesos/blob/1.1.x/src/webui/master/static/js/controllers.js#L564-L577
* This one blows away the {{$scope.state}} in favor of a new one: 
https://github.com/apache/mesos/blob/1.1.x/src/webui/master/static/js/controllers.js#L521

The metrics callback should simply assign to a different variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

2016-12-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729768#comment-15729768
 ] 

Joseph Wu commented on MESOS-6743:
--

Just food for thought:

A timeout and retry for {{docker stop}} is definitely something we want to add 
(1).  

Suppose however, that {{docker stop}} is completely and forever broken.  In 
this case, it may be best for the executor to *kill the agent* (or somehow 
trigger the agent's death).  When the agent is restarted, it will then detect 
some orphan docker tasks (given {{--docker_kill_orphans}}), and attempt to kill 
them.  If that fails, the agent will fail to recover and start flapping 
(restart, detect orphans, fail to kill, suicide, ...).
^ This is preferable to me, compared to (2) and (3).

> Docker executor hangs forever if `docker stop` fails.
> -
>
> Key: MESOS-6743
> URL: https://issues.apache.org/jira/browse/MESOS-6743
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.1, 1.1.0
>Reporter: Alexander Rukletsov
>  Labels: mesosphere
>
> If {{docker stop}} finishes with an error status, the executor should catch 
> this and react instead of indefinitely waiting for {{reaped}} to return.
> An interesting question is _how_ to react. Here are possible solutions.
> 1. Retry {{docker stop}}. In this case it is unclear how many times to retry 
> and what to do if {{docker stop}} continues to fail.
> 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. 
> However, in this case it is unclear what status updates we should send: 
> {[TASK_KILLING}} for every kill retry? an extra update when we failed to kill 
> a task? or set a specific reason in {{TASK_KILLING}}?
> 3. Clean up and exit. In this case we should make sure the task container is 
> killed or notify the framework and the operator that the container may still 
> be running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6749) Update master and agent endpoints to expose FrameworkInfo.roles.

2016-12-07 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-6749:
--

 Summary: Update master and agent endpoints to expose 
FrameworkInfo.roles.
 Key: MESOS-6749
 URL: https://issues.apache.org/jira/browse/MESOS-6749
 Project: Mesos
  Issue Type: Task
  Components: agent, master
Reporter: Benjamin Mahler
Assignee: Benjamin Bannier


With the addition of the FrameworkInfo.roles field, all of the endpoints that 
expose the framework information need to be updated to expose this additional 
field.

It should be the case that for the v1-style operator calls, the new field will 
be automatically visible thanks to the direct mapping from protobuf (we should 
verify this).

We can track the updates to metrics separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6676) Always re-link with scheduler during re-registration

2016-12-07 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6676:
---
Shepherd: Vinod Kone

> Always re-link with scheduler during re-registration
> 
>
> Key: MESOS-6676
> URL: https://issues.apache.org/jira/browse/MESOS-6676
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Scenario:
> # Framework registers with master using a non-zero {{failover_timeout}} and 
> is assigned a FrameworkID.
> # The master sees an {{ExitedEvent}} for the master->scheduler link. This 
> could happen due to some transient network error, e.g., 1-way partition. The 
> master sends a {{FrameworkErrorMessage}} to the framework. The master marks 
> the framework as disconnected, but keeps the {{Framework*}} for it around in 
> {{frameworks.registered}}.
> # The framework doesn't receive the {{FrameworkErrorMessage}} because it is 
> dropped by the network.
> # The scheduler might receive an {{ExitedEvent}} for the scheduler -> master 
> link, but it ignores this anyway (see MESOS-887).
> # The scheduler sees a new-master-detected event and re-registers with the 
> master. It doesn _not_ set the {{force}} flag. This means we follow [this 
> code 
> path|https://github.com/apache/mesos/blob/a6bab9015cd63121081495b8291635f386b95a92/src/master/master.cpp#L2771]
>  in the master, which does _not_ relink with the scheduler.
> The result is that scheduler re-registration succeds, but the master -> 
> scheduler link is never re-established.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6676) Always re-link with scheduler during re-registration

2016-12-07 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway reassigned MESOS-6676:
--

Assignee: Neil Conway

> Always re-link with scheduler during re-registration
> 
>
> Key: MESOS-6676
> URL: https://issues.apache.org/jira/browse/MESOS-6676
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Scenario:
> # Framework registers with master using a non-zero {{failover_timeout}} and 
> is assigned a FrameworkID.
> # The master sees an {{ExitedEvent}} for the master->scheduler link. This 
> could happen due to some transient network error, e.g., 1-way partition. The 
> master sends a {{FrameworkErrorMessage}} to the framework. The master marks 
> the framework as disconnected, but keeps the {{Framework*}} for it around in 
> {{frameworks.registered}}.
> # The framework doesn't receive the {{FrameworkErrorMessage}} because it is 
> dropped by the network.
> # The scheduler might receive an {{ExitedEvent}} for the scheduler -> master 
> link, but it ignores this anyway (see MESOS-887).
> # The scheduler sees a new-master-detected event and re-registers with the 
> master. It doesn _not_ set the {{force}} flag. This means we follow [this 
> code 
> path|https://github.com/apache/mesos/blob/a6bab9015cd63121081495b8291635f386b95a92/src/master/master.cpp#L2771]
>  in the master, which does _not_ relink with the scheduler.
> The result is that scheduler re-registration succeds, but the master -> 
> scheduler link is never re-established.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6742) Adding support for s390x architecture

2016-12-07 Thread Abhishek Dasgupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729643#comment-15729643
 ] 

Abhishek Dasgupta commented on MESOS-6742:
--

For dev contributors list, you may raise a pull request in github. Please, send 
a mail to the dev mailing list as well to have the contributors access 
mentioning your reviewboard id and jira id.

> Adding support for s390x architecture 
> --
>
> Key: MESOS-6742
> URL: https://issues.apache.org/jira/browse/MESOS-6742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ayanampudi Varsha
>
> There are 2 issues:
> 1. LdcacheTest.Parse test case fails on s390x machines.
> 2. From the value of flag docker_registry in slave.cpp, amd64 images get 
> downloaded due to which test cases fail on s390x with "Exec format Error"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6747) ContainerLogger runnable must not inherit the slave environment.

2016-12-07 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-6747:
--
Shepherd: Joseph Wu
Assignee: Till Toenshoff

> ContainerLogger runnable must not inherit the slave environment.
> 
>
> Key: MESOS-6747
> URL: https://issues.apache.org/jira/browse/MESOS-6747
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, logger
>
> The ContainerLogger module which forks a child process named 
> "mesos-logrotate-logger" does inherit the slave's environment. Specifically 
> things like {{LIBPROCESS_SSL_}} variables are not meant to be picked up 
> by that runnable and cause issues as soon as the owning user is not the same 
> as the one owning the agent process.
> So if the agent has an SSL key setup via {{LIBPROCESS_SSL_KEY_FILE}} and if 
> that key-file is readable by the agent user (root) only, then the 
> {{mesos-logrotate-logger}} will try to read that file as well even though it 
> is being run as nobody - that action will then fail the runnable and hence 
> fail the entire task.
> {noformat}
> Could not load key file '/my/funky/key/path/key.key' (OpenSSL error 
> #33558541): error:0200100D:system library:fopen:Permission denied
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6679) Allow `network/cni` isolator to dynamically load CNI configuration

2016-12-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6679:
-
Summary: Allow `network/cni` isolator to dynamically load CNI configuration 
 (was: All `network/cni` isolator to dynamically load CNI configuration)

> Allow `network/cni` isolator to dynamically load CNI configuration
> --
>
> Key: MESOS-6679
> URL: https://issues.apache.org/jira/browse/MESOS-6679
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Currently the `network/cni` isolator learns the CNI config at startup. In 
> case the CNI config changes after the agent has started, the agent needs to 
> be restarted in order to learn any modifications to the CNI config.
> We would like the `network/cni` isolator to be able to load CNI config on the 
> fly without a restart. To achieve this we plan to introduce a new endpoint on 
> the `network/cni` isolator that would allow the operator to explicitly ask 
> the `network/cni` isolator to reload the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6748) I/O switchboard should inherit agent environment variables.

2016-12-07 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-6748:
-

Assignee: Jie Yu

> I/O switchboard should inherit agent environment variables.
> ---
>
> Key: MESOS-6748
> URL: https://issues.apache.org/jira/browse/MESOS-6748
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> Since it is a libexec binary that owned by Mesos. Agent might have some 
> environment variables (e.g., LD_LIBRARY_PATH) that are needed by the io 
> switchboard server process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6748) I/O switchboard should inherit agent environment variables.

2016-12-07 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6748:
-

 Summary: I/O switchboard should inherit agent environment 
variables.
 Key: MESOS-6748
 URL: https://issues.apache.org/jira/browse/MESOS-6748
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


Since it is a libexec binary that owned by Mesos. Agent might have some 
environment variables (e.g., LD_LIBRARY_PATH) that are needed by the io 
switchboard server process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6747) ContainerLogger runnable must not inherit the slave environment.

2016-12-07 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729409#comment-15729409
 ] 

Till Toenshoff commented on MESOS-6747:
---

Here is some information on how I got to the root of this problem.

The agent is setup for using SSL via the {{LIBPROCESS_SSL...}} variables.

The output within the agent log whenever a task is about to get run:
{noformat}
[...]
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.491056  2455 
container_state_cache_impl.cpp:134] Writing container 
file[/var/run/mesos/isolators/com_mesosphere_MetricsIsolatorModule/containers/b8f97301-c477-49bc-87ed-1e7ea49366bf]
 with endpoint[198.51.100.1:37476]
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.491147  2455 sync_util.hpp:136] Result for 
ticket 1297 complete, returning value.
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.491202  2421 sync_util.hpp:83] Dispatch 
result obtained for ticket 1297 after waiting <=5s: register_and_update_cache
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.534430  2424 systemd.cpp:96] Assigned child 
process '16285' to 'mesos_executors.slice'
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.541784  2424 systemd.cpp:96] Assigned child 
process '16286' to 'mesos_executors.slice'
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.551643  2425 linux_launcher.cpp:429] 
Launching container b8f97301-c477-49bc-87ed-1e7ea49366bf and cloning with 
namespaces CLONE_NEWNS
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.555986  2425 systemd.cpp:96] Assigned child 
process '16288' to 'mesos_executors.slice'
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.559031  2420 containerizer.cpp:1577] 
Checkpointing container's forked pid 16288 to 
'/var/lib/mesos/slave/meta/slaves/b79c72c2-5566-4a2f-86da-668611ee4e78-S0/frameworks/b79c72c2-5566-4a2f-86da-668611ee4e78-0001/executors/aoaoaoaoaoao.c3a39772-bca6-11e6-9988-70b3d581/runs/b8f97301-c477-49bc-87ed-1e7ea49366bf/pids/forked.pid'
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: WARNING: Logging before InitGoogleLogging() is written to 
STDERR
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.597594 16286 openssl.cpp:424] CA directory 
path unspecified! NOTE: Set CA directory path with 
LIBPROCESS_SSL_CA_DIR=
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.597721 16286 openssl.cpp:429] Will not verify 
peer certificate!
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer 
certificate verification
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.597728 16286 openssl.cpp:435] Will only 
verify peer certificate if presented!
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer 
certificate verification
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: WARNING: Logging before InitGoogleLogging() is written to 
STDERR
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.597862 16285 openssl.cpp:424] CA directory 
path unspecified! NOTE: Set CA directory path with 
LIBPROCESS_SSL_CA_DIR=
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.597965 16285 openssl.cpp:429] Will not verify 
peer certificate!
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer 
certificate verification
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: I1207 18:58:31.597975 16285 openssl.cpp:435] Will only 
verify peer certificate if presented!
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer 
certificate verification
Dec 07 18:58:31 test-c0239131-89be-41fa-8193-17d9e67761d3.localdomain 
mesos-agent[2411]: Could not load key file 
'/run/dcos/pki/tls/private/mesos-slave.key' (OpenSSL error #33558541): 
error:0200100D:system library:fopen:Permission denied
[...]
{noformat}

Get the agent pid:
{noformat}
$ ps aux |grep mesos-agent
root  2412  3.0  0.9 1169092 142968 ?  Sl   17:35   1:41 

[jira] [Updated] (MESOS-6747) ContainerLogger runnable must not inherit the slave environment.

2016-12-07 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-6747:
--
Affects Version/s: 1.2.0

> ContainerLogger runnable must not inherit the slave environment.
> 
>
> Key: MESOS-6747
> URL: https://issues.apache.org/jira/browse/MESOS-6747
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, logger
>
> The ContainerLogger module which forks a child process named 
> "mesos-logrotate-logger" does inherit the slave's environment. Specifically 
> things like {{LIBPROCESS_SSL_}} variables are not meant to be picked up 
> by that runnable and cause issues as soon as the owning user is not the same 
> as the one owning the agent process.
> So if the agent has an SSL key setup via {{LIBPROCESS_SSL_KEY_FILE}} and if 
> that key-file is readable by the agent user (root) only, then the 
> {{mesos-logrotate-logger}} will try to read that file as well even though it 
> is being run as nobody - that action will then fail the runnable and hence 
> fail the entire task.
> {noformat}
> Could not load key file '/my/funky/key/path/key.key' (OpenSSL error 
> #33558541): error:0200100D:system library:fopen:Permission denied
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6747) ContainerLogger runnable must not inherit the slave environment.

2016-12-07 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-6747:
--
Labels: libprocess logger  (was: )

> ContainerLogger runnable must not inherit the slave environment.
> 
>
> Key: MESOS-6747
> URL: https://issues.apache.org/jira/browse/MESOS-6747
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, logger
>
> The ContainerLogger module which forks a child process named 
> "mesos-logrotate-logger" does inherit the slave's environment. Specifically 
> things like {{LIBPROCESS_SSL_}} variables are not meant to be picked up 
> by that runnable and cause issues as soon as the owning user is not the same 
> as the one owning the agent process.
> So if the agent has an SSL key setup via {{LIBPROCESS_SSL_KEY_FILE}} and if 
> that key-file is readable by the agent user (root) only, then the 
> {{mesos-logrotate-logger}} will try to read that file as well even though it 
> is being run as nobody - that action will then fail the runnable and hence 
> fail the entire task.
> {noformat}
> Could not load key file '/my/funky/key/path/key.key' (OpenSSL error 
> #33558541): error:0200100D:system library:fopen:Permission denied
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6747) ContainerLogger runnable must not inherit the slave environment.

2016-12-07 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729392#comment-15729392
 ] 

Till Toenshoff commented on MESOS-6747:
---

This did not pop up earlier because originally the mesos-logrotate-logger was 
running in the agent context and as the agent user, hence it did have no issues 
accessing the key-file. Now that 
https://issues.apache.org/jira/browse/MESOS-5856 has landed, the logger is 
running as a different user, causing this problem to surface.

> ContainerLogger runnable must not inherit the slave environment.
> 
>
> Key: MESOS-6747
> URL: https://issues.apache.org/jira/browse/MESOS-6747
> Project: Mesos
>  Issue Type: Bug
>Reporter: Till Toenshoff
>Priority: Blocker
>
> The ContainerLogger module which forks a child process named 
> "mesos-logrotate-logger" does inherit the slave's environment. Specifically 
> things like {{LIBPROCESS_SSL_}} variables are not meant to be picked up 
> by that runnable and cause issues as soon as the owning user is not the same 
> as the one owning the agent process.
> So if the agent has an SSL key setup via {{LIBPROCESS_SSL_KEY_FILE}} and if 
> that key-file is readable by the agent user (root) only, then the 
> {{mesos-logrotate-logger}} will try to read that file as well even though it 
> is being run as nobody - that action will then fail the runnable and hence 
> fail the entire task.
> {noformat}
> Could not load key file '/my/funky/key/path/key.key' (OpenSSL error 
> #33558541): error:0200100D:system library:fopen:Permission denied
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6747) ContainerLogger runnable must not inherit the slave environment.

2016-12-07 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-6747:
-

 Summary: ContainerLogger runnable must not inherit the slave 
environment.
 Key: MESOS-6747
 URL: https://issues.apache.org/jira/browse/MESOS-6747
 Project: Mesos
  Issue Type: Bug
Reporter: Till Toenshoff
Priority: Blocker


The ContainerLogger module which forks a child process named 
"mesos-logrotate-logger" does inherit the slave's environment. Specifically 
things like {{LIBPROCESS_SSL_}} variables are not meant to be picked up by 
that runnable and cause issues as soon as the owning user is not the same as 
the one owning the agent process.
So if the agent has an SSL key setup via {{LIBPROCESS_SSL_KEY_FILE}} and if 
that key-file is readable by the agent user (root) only, then the 
{{mesos-logrotate-logger}} will try to read that file as well even though it is 
being run as nobody - that action will then fail the runnable and hence fail 
the entire task.

{noformat}
Could not load key file '/my/funky/key/path/key.key' (OpenSSL error #33558541): 
error:0200100D:system library:fopen:Permission denied
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6746) IOSwitchboard doesn't properly flush data on ATTACH_CONTAINER_OUTPUT

2016-12-07 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-6746:
--

 Summary: IOSwitchboard doesn't properly flush data on 
ATTACH_CONTAINER_OUTPUT
 Key: MESOS-6746
 URL: https://issues.apache.org/jira/browse/MESOS-6746
 Project: Mesos
  Issue Type: Bug
Reporter: Kevin Klues
Assignee: Anand Mazumdar


Currently we are doing a close on the write end of all connection pipes when we 
exit the switchboard, but we don't wait until the read is flushed before 
exiting. This can cause some data to get dropped since the process may exit 
before the reader is flushed.  The current code is:
{noformat}
void IOSwitchboardServerProcess::finalize()   
{ 
  foreach (HttpConnection& connection, outputConnections) {   
connection.close();  
  }   
  
  if (failure.isSome()) {
promise.fail(failure->message);   
  } else {
promise.set(Nothing());   
  }   
} 
{noformat}

We should change it to:
{noformat}
void IOSwitchboardServerProcess::finalize()   
{ 
  foreach (HttpConnection& connection, outputConnections) {   
connection.close();
connection.closed().await();  
  }   
  
  if (failure.isSome()) {
promise.fail(failure->message);   
  } else {
promise.set(Nothing());   
  }   
} 
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6726) IOSwitchboardServerFlags adds flags for non-optional fields w/o providing a default value

2016-12-07 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729252#comment-15729252
 ] 

Kevin Klues commented on MESOS-6726:


How do I test that I've fixed it?

> IOSwitchboardServerFlags adds flags for non-optional fields w/o providing a 
> default value
> -
>
> Key: MESOS-6726
> URL: https://issues.apache.org/jira/browse/MESOS-6726
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Kevin Klues
>  Labels: tech-debt
>
> The class {{IOSwitchboardFlags}} contains a number of members of 
> non-{{Option}}, fundamental type (i.e., types which do not have 
> constructors). As customary for a {{Flags}} class, these fields are not 
> initialized since usually the initialization is done by the calling the 
> correct overload of {{FlagsBase::add}} taking a default value.
> The class {{IOSwitchbardFlags}} calls an {{add}} overload acting on a 
> non-{{Option}} member which does not take the default value. This can lead to 
> the members containing garbage values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6726) IOSwitchboardServerFlags adds flags for non-optional fields w/o providing a default value

2016-12-07 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues reassigned MESOS-6726:
--

Assignee: Kevin Klues

> IOSwitchboardServerFlags adds flags for non-optional fields w/o providing a 
> default value
> -
>
> Key: MESOS-6726
> URL: https://issues.apache.org/jira/browse/MESOS-6726
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Kevin Klues
>  Labels: tech-debt
>
> The class {{IOSwitchboardFlags}} contains a number of members of 
> non-{{Option}}, fundamental type (i.e., types which do not have 
> constructors). As customary for a {{Flags}} class, these fields are not 
> initialized since usually the initialization is done by the calling the 
> correct overload of {{FlagsBase::add}} taking a default value.
> The class {{IOSwitchbardFlags}} calls an {{add}} overload acting on a 
> non-{{Option}} member which does not take the default value. This can lead to 
> the members containing garbage values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-07 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729231#comment-15729231
 ] 

Anand Mazumdar commented on MESOS-6744:
---

>From the logs, this looks a separate issue than what we fixed in MESOS-6576 
>(around status update reordering).

> DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
> ---
>
> Key: MESOS-6744
> URL: https://issues.apache.org/jira/browse/MESOS-6744
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM, amd64.
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (~10 test iterations or fewer). Test log:
> {noformat}
> [ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
> I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
> I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
> I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(64)@10.0.2.15:46643
> I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
> I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
> I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
> I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:26:47.763380 28651 master.cpp:380] Master 
> 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
> 10.0.2.15:46643
> I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/7lpy50/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
> I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7lpy50/credentials'
> I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled
> I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master!
> I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar
> I1208 03:26:47.768766 28648 log.cpp:553] Attempting to start the writer
> I1208 03:26:47.769899 28653 replica.cpp:493] Replica 

[jira] [Commented] (MESOS-6614) ExamplesTest.DiskFullFramework hangs on FreeBSD

2016-12-07 Thread David Forsythe (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729202#comment-15729202
 ] 

David Forsythe commented on MESOS-6614:
---

This is actually just happening inside of a jail, since it wants 127.0.0.1 as 
the master address.

> ExamplesTest.DiskFullFramework hangs on FreeBSD
> ---
>
> Key: MESOS-6614
> URL: https://issues.apache.org/jira/browse/MESOS-6614
> Project: Mesos
>  Issue Type: Bug
>Reporter: David Forsythe
>Assignee: David Forsythe
>
> ExamplesTest.DiskFullFramework hangs on gmake check on FreeBSD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6745) MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky

2016-12-07 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729193#comment-15729193
 ] 

Neil Conway commented on MESOS-6745:


CC [~anandmazumdar] [~bbannier]

> MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky
> --
>
> Key: MESOS-6745
> URL: https://issues.apache.org/jira/browse/MESOS-6745
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (< 20 test iterations), using {{master}} as 
> of {{ab79d58c9df0ffb8ad35f6662541e7a5c3ea4a80}}. Test log:
> {noformat}
> [--] 1 test from MesosContainerizer/DefaultExecutorTest
> [ RUN  ] MesosContainerizer/DefaultExecutorTest.KillTask/0
> I1208 03:32:34.943745 29285 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:32:34.944695 29285 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:32:34.945287 29306 recover.cpp:451] Starting replica recovery
> I1208 03:32:34.945431 29306 recover.cpp:477] Replica is in EMPTY status
> I1208 03:32:34.946542 29300 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(127)@10.0.2.15:36807
> I1208 03:32:34.946768 29301 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:32:34.947377 29299 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:32:34.947746 29306 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:32:34.947887 29306 recover.cpp:477] Replica is in STARTING status
> I1208 03:32:34.948559 29306 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(128)@10.0.2.15:36807
> I1208 03:32:34.948771 29299 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:32:34.949097 29302 recover.cpp:568] Updating replica status to VOTING
> I1208 03:32:34.949385 29306 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:32:34.949467 29306 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:32:34.971436 29301 master.cpp:380] Master 
> 67de7bda-9b5b-4fe9-aede-390ec9ca7290 (archlinux.vagrant.vm) started on 
> 10.0.2.15:36807
> I1208 03:32:34.971519 29301 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/8oMk6W/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/8oMk6W/master" --zk_session_timeout="10secs"
> I1208 03:32:34.971824 29301 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:32:34.971832 29301 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:32:34.971837 29301 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:32:34.971842 29301 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/8oMk6W/credentials'
> I1208 03:32:34.972051 29301 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:32:34.972198 29301 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:32:34.972327 29301 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:32:34.972436 29301 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:32:34.972561 29301 master.cpp:584] Authorization enabled
> I1208 03:32:34.974555 29300 master.cpp:2043] Elected as the leading master!
> I1208 03:32:34.974586 29300 master.cpp:1566] Recovering from registrar
> I1208 03:32:34.975244 29306 log.cpp:553] Attempting to start the writer
> I1208 

[jira] [Created] (MESOS-6745) MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky

2016-12-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6745:
--

 Summary: MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky
 Key: MESOS-6745
 URL: https://issues.apache.org/jira/browse/MESOS-6745
 Project: Mesos
  Issue Type: Bug
 Environment: Recent Arch Linux VM
Reporter: Neil Conway


This repros consistently for me (< 20 test iterations), using {{master}} as of 
{{ab79d58c9df0ffb8ad35f6662541e7a5c3ea4a80}}. Test log:

{noformat}
[--] 1 test from MesosContainerizer/DefaultExecutorTest
[ RUN  ] MesosContainerizer/DefaultExecutorTest.KillTask/0
I1208 03:32:34.943745 29285 cluster.cpp:160] Creating default 'local' authorizer
I1208 03:32:34.944695 29285 replica.cpp:776] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I1208 03:32:34.945287 29306 recover.cpp:451] Starting replica recovery
I1208 03:32:34.945431 29306 recover.cpp:477] Replica is in EMPTY status
I1208 03:32:34.946542 29300 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(127)@10.0.2.15:36807
I1208 03:32:34.946768 29301 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I1208 03:32:34.947377 29299 recover.cpp:568] Updating replica status to STARTING
I1208 03:32:34.947746 29306 replica.cpp:320] Persisted replica status to 
STARTING
I1208 03:32:34.947887 29306 recover.cpp:477] Replica is in STARTING status
I1208 03:32:34.948559 29306 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(128)@10.0.2.15:36807
I1208 03:32:34.948771 29299 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I1208 03:32:34.949097 29302 recover.cpp:568] Updating replica status to VOTING
I1208 03:32:34.949385 29306 replica.cpp:320] Persisted replica status to VOTING
I1208 03:32:34.949467 29306 recover.cpp:582] Successfully joined the Paxos group
I1208 03:32:34.971436 29301 master.cpp:380] Master 
67de7bda-9b5b-4fe9-aede-390ec9ca7290 (archlinux.vagrant.vm) started on 
10.0.2.15:36807
I1208 03:32:34.971519 29301 master.cpp:382] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/8oMk6W/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry_store_timeout="100secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/8oMk6W/master" --zk_session_timeout="10secs"
I1208 03:32:34.971824 29301 master.cpp:432] Master only allowing authenticated 
frameworks to register
I1208 03:32:34.971832 29301 master.cpp:446] Master only allowing authenticated 
agents to register
I1208 03:32:34.971837 29301 master.cpp:459] Master only allowing authenticated 
HTTP frameworks to register
I1208 03:32:34.971842 29301 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/8oMk6W/credentials'
I1208 03:32:34.972051 29301 master.cpp:504] Using default 'crammd5' 
authenticator
I1208 03:32:34.972198 29301 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1208 03:32:34.972327 29301 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1208 03:32:34.972436 29301 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1208 03:32:34.972561 29301 master.cpp:584] Authorization enabled
I1208 03:32:34.974555 29300 master.cpp:2043] Elected as the leading master!
I1208 03:32:34.974586 29300 master.cpp:1566] Recovering from registrar
I1208 03:32:34.975244 29306 log.cpp:553] Attempting to start the writer
I1208 03:32:34.976706 29304 replica.cpp:493] Replica received implicit promise 
request from __req_res__(129)@10.0.2.15:36807 with proposal 1
I1208 03:32:34.976793 29304 replica.cpp:342] Persisted promised to 1
I1208 03:32:34.977449 29300 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I1208 03:32:34.978907 29303 replica.cpp:388] Replica received explicit promise 
request from __req_res__(130)@10.0.2.15:36807 for 

[jira] [Commented] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-07 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729180#comment-15729180
 ] 

Neil Conway commented on MESOS-6744:


CC [~anandmazumdar] [~bbannier]

> DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
> ---
>
> Key: MESOS-6744
> URL: https://issues.apache.org/jira/browse/MESOS-6744
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM, amd64.
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (~10 test iterations or fewer). Test log:
> {noformat}
> [ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
> I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
> I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
> I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(64)@10.0.2.15:46643
> I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
> I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
> I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
> I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:26:47.763380 28651 master.cpp:380] Master 
> 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
> 10.0.2.15:46643
> I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/7lpy50/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
> I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7lpy50/credentials'
> I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled
> I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master!
> I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar
> I1208 03:26:47.768766 28648 log.cpp:553] Attempting to start the writer
> I1208 03:26:47.769899 28653 replica.cpp:493] Replica received implicit 
> promise request from __req_res__(66)@10.0.2.15:46643 with proposal 1
> 

[jira] [Created] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6744:
--

 Summary: DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
 Key: MESOS-6744
 URL: https://issues.apache.org/jira/browse/MESOS-6744
 Project: Mesos
  Issue Type: Bug
 Environment: Recent Arch Linux VM, amd64.
Reporter: Neil Conway


This repros consistently for me (~10 test iterations or fewer). Test log:

{noformat}
[ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' authorizer
I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(64)@10.0.2.15:46643
I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to STARTING
I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
STARTING
I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to VOTING
I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos group
I1208 03:26:47.763380 28651 master.cpp:380] Master 
0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
10.0.2.15:46643
I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/7lpy50/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry_store_timeout="100secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing authenticated 
frameworks to register
I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing authenticated 
agents to register
I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing authenticated 
HTTP frameworks to register
I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/7lpy50/credentials'
I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
authenticator
I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled
I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master!
I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar
I1208 03:26:47.768766 28648 log.cpp:553] Attempting to start the writer
I1208 03:26:47.769899 28653 replica.cpp:493] Replica received implicit promise 
request from __req_res__(66)@10.0.2.15:46643 with proposal 1
I1208 03:26:47.769984 28653 replica.cpp:342] Persisted promised to 1
I1208 03:26:47.770534 28652 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I1208 03:26:47.771479 28652 replica.cpp:388] Replica received explicit promise 
request from __req_res__(67)@10.0.2.15:46643 for position 0 with proposal 2
I1208 03:26:47.772897 28650 replica.cpp:537] Replica received write request for 
position 0 from 

[jira] [Commented] (MESOS-6726) IOSwitchboardServerFlags adds flags for non-optional fields w/o providing a default value

2016-12-07 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729162#comment-15729162
 ] 

Benjamin Bannier commented on MESOS-6726:
-

[~klueska]: Could you find some time to address these?

> IOSwitchboardServerFlags adds flags for non-optional fields w/o providing a 
> default value
> -
>
> Key: MESOS-6726
> URL: https://issues.apache.org/jira/browse/MESOS-6726
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>  Labels: tech-debt
>
> The class {{IOSwitchboardFlags}} contains a number of members of 
> non-{{Option}}, fundamental type (i.e., types which do not have 
> constructors). As customary for a {{Flags}} class, these fields are not 
> initialized since usually the initialization is done by the calling the 
> correct overload of {{FlagsBase::add}} taking a default value.
> The class {{IOSwitchbardFlags}} calls an {{add}} overload acting on a 
> non-{{Option}} member which does not take the default value. This can lead to 
> the members containing garbage values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

2016-12-07 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6743:
--

 Summary: Docker executor hangs forever if `docker stop` fails.
 Key: MESOS-6743
 URL: https://issues.apache.org/jira/browse/MESOS-6743
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 1.0.1, 1.1.0
Reporter: Alexander Rukletsov


If {{docker stop}} finishes with an error status, the executor should catch 
this and react instead of indefinitely waiting for {{reaped}} to return.

An interesting question is _how_ to react. Here are possible solutions.

1. Retry {{docker stop}}. In this case it is unclear how many times to retry 
and what to do if {{docker stop}} continues to fail.

2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. 
However, in this case it is unclear what status updates we should send: 
{[TASK_KILLING}} for every kill retry? an extra update when we failed to kill a 
task? or set a specific reason in {{TASK_KILLING}}?

3. Clean up and exit. In this case we should make sure the task container is 
killed or notify the framework and the operator that the container may still be 
running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6742) Adding support for s390x architecture

2016-12-07 Thread Ayanampudi Varsha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728914#comment-15728914
 ] 

Ayanampudi Varsha commented on MESOS-6742:
--

Hi,
I would like to submit a code change to add s390x support for mesos. I need to 
get added to dev contributors list. I need contributors access for the same.

Thanks,

> Adding support for s390x architecture 
> --
>
> Key: MESOS-6742
> URL: https://issues.apache.org/jira/browse/MESOS-6742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ayanampudi Varsha
>
> There are 2 issues:
> 1. LdcacheTest.Parse test case fails on s390x machines.
> 2. From the value of flag docker_registry in slave.cpp, amd64 images get 
> downloaded due to which test cases fail on s390x with "Exec format Error"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2016-12-07 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728864#comment-15728864
 ] 

haosdent commented on MESOS-4812:
-

Thanks [~lloesche]'s help. I could reproduce by this application definition
{code}
{
  "id": "/test",
  "cmd": null,
  "cpus": 1,
  "mem": 128,
  "disk": 0,
  "instances": 1,
  "executor": null,
  "fetch": null,
  "constraints": null,
  "acceptedResourceRoles": null,
  "user": null,
  "container": {
"docker": {
  "image": "nginx",
  "forcePullImage": false,
  "privileged": false,
  "portMappings": [
{
  "containerPort": 80,
  "protocol": "tcp"
}
  ],
  "network": "BRIDGE"
}
  },
  "labels": null,
  "healthChecks": [
{
  "protocol": "COMMAND",
  "command": {
"value": "bash -c \" commandArguments;
  commandArguments.push_back(docker->getPath());
  commandArguments.push_back("exec");
  commandArguments.push_back(containerName);

  if (command.shell()) {
commandArguments.push_back("sh");
commandArguments.push_back("-c");
commandArguments.push_back("\"");
commandArguments.push_back(command.value());
commandArguments.push_back("\"");
  } else {
commandArguments.push_back(command.value());

foreach (const string& argument, command.arguments()) {
  commandArguments.push_back(argument);
}
  }

  healthCheck.mutable_command()->set_shell(true); <-- Cause problem.
  healthCheck.mutable_command()->clear_arguments();
  healthCheck.mutable_command()->set_value(
  strings::join(" ", commandArguments)); <-- Cause problem.
{code}

Then it would generate the health check command 
{code}
sh -c 'docker exec 
mesos-ce13aa71-ebba-4361-b6dd-8d4ce57ea4ab-S9.566f6c77-a6c9-46e0-bc40-5fe95a1aa9ae
 sh -c " bash -c " Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6742) Adding support for s390x architecture

2016-12-07 Thread Ayanampudi Varsha (JIRA)
Ayanampudi Varsha created MESOS-6742:


 Summary: Adding support for s390x architecture 
 Key: MESOS-6742
 URL: https://issues.apache.org/jira/browse/MESOS-6742
 Project: Mesos
  Issue Type: Bug
Reporter: Ayanampudi Varsha


There are 2 issues:
1. LdcacheTest.Parse test case fails on s390x machines.
2. From the value of flag docker_registry in slave.cpp, amd64 images get 
downloaded due to which test cases fail on s390x with "Exec format Error"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6741) Authorize v1 SET_LOGGING_LEVEL call

2016-12-07 Thread Adam B (JIRA)
Adam B created MESOS-6741:
-

 Summary: Authorize v1 SET_LOGGING_LEVEL call
 Key: MESOS-6741
 URL: https://issues.apache.org/jira/browse/MESOS-6741
 Project: Mesos
  Issue Type: Bug
  Components: agent, security
Reporter: Adam B


We need to add authz to this call to prevent unauthorized users from cranking 
the log level way up to take down an agent/master.
In the v0 API, we protected the /logging/toggle endpoint with a 
"coarse-grained" GET_ENDPOINT_WITH_PATH ACL, but that cannot be reused 
(directly) in the v1 API.
We could add an analagous coarse-grained V1_CALL_WITH_ACTION ACL, but we're 
probably better off just adding a trivial SET_LOG_LEVEL Authorization::Action 
and ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6739) Authorize v1 GET_CONTAINERS call

2016-12-07 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6739:
--
Priority: Critical  (was: Major)

> Authorize v1 GET_CONTAINERS call
> 
>
> Key: MESOS-6739
> URL: https://issues.apache.org/jira/browse/MESOS-6739
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Priority: Critical
>  Labels: security
>
> We need some kind of authorization for GET_CONTAINERS.
> a. Coarse-grained like we already did for /containers. With this you could 
> say that Alice can GET_CONTAINERS for any/all containers on the cluster, but 
> Bob cannot see any containers' info.
> b. Fine-grained authz like we have for /state and /tasks. With this you could 
> say that Alice can GET_CONTAINERS and see filtered results where user=alice, 
> but Bob can only see filtered results where user=bob. It would be nice to 
> port this to /containers as well if/when we add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6741) Authorize v1 SET_LOGGING_LEVEL call

2016-12-07 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6741:
--
Priority: Minor  (was: Major)

> Authorize v1 SET_LOGGING_LEVEL call
> ---
>
> Key: MESOS-6741
> URL: https://issues.apache.org/jira/browse/MESOS-6741
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Priority: Minor
>  Labels: security
>
> We need to add authz to this call to prevent unauthorized users from cranking 
> the log level way up to take down an agent/master.
> In the v0 API, we protected the /logging/toggle endpoint with a 
> "coarse-grained" GET_ENDPOINT_WITH_PATH ACL, but that cannot be reused 
> (directly) in the v1 API.
> We could add an analagous coarse-grained V1_CALL_WITH_ACTION ACL, but we're 
> probably better off just adding a trivial SET_LOG_LEVEL Authorization::Action 
> and ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6740) Authorize v1 GET_FLAGS call

2016-12-07 Thread Adam B (JIRA)
Adam B created MESOS-6740:
-

 Summary: Authorize v1 GET_FLAGS call
 Key: MESOS-6740
 URL: https://issues.apache.org/jira/browse/MESOS-6740
 Project: Mesos
  Issue Type: Bug
  Components: agent, security
Reporter: Adam B


We already have a VIEW_FLAGS ACL that we use for /flags and the flags part of 
/state. Let's add authz to the v1 GET_FLAGS API call (on agent and master) and 
reuse that ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6739) Authorize v1 GET_CONTAINERS call

2016-12-07 Thread Adam B (JIRA)
Adam B created MESOS-6739:
-

 Summary: Authorize v1 GET_CONTAINERS call
 Key: MESOS-6739
 URL: https://issues.apache.org/jira/browse/MESOS-6739
 Project: Mesos
  Issue Type: Bug
  Components: agent, security
Reporter: Adam B


We need some kind of authorization for GET_CONTAINERS.
a. Coarse-grained like we already did for /containers. With this you could say 
that Alice can GET_CONTAINERS for any/all containers on the cluster, but Bob 
cannot see any containers' info.
b. Fine-grained authz like we have for /state and /tasks. With this you could 
say that Alice can GET_CONTAINERS and see filtered results where user=alice, 
but Bob can only see filtered results where user=bob. It would be nice to port 
this to /containers as well if/when we add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)