[jira] [Created] (MESOS-6344) Allow `network/cni` isolator to take a search path for CNI plugins instead of single directory

2016-10-07 Thread Avinash Sridharan (JIRA)
Avinash Sridharan created MESOS-6344:


 Summary: Allow `network/cni` isolator to take a search path for 
CNI plugins instead of single directory
 Key: MESOS-6344
 URL: https://issues.apache.org/jira/browse/MESOS-6344
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Avinash Sridharan
Assignee: Avinash Sridharan


Currently the `network/cni` isolator expects a single directory with the 
`--network_cni_plugins_dir` . This is very limiting because this forces the 
operator to put all the CNI plugins in the same directory. 

With Mesos port-mapper CNI plugin this would also imply that the operator would 
have to move this plugin from the Mesos installation directory to a directory 
specified in the `--network_cni_plugins_dir`. 

To simplify the operators experience it would make sense for the 
`--network_cni_plugins_dir` flag to take in set of directories instead of 
single directory. The `network/cni` isolator can then search this set of 
directories to find the CNI plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6343) Documentation Error: Default Executor does not implicitly construct resources

2016-10-07 Thread Joris Van Remoortere (JIRA)
Joris Van Remoortere created MESOS-6343:
---

 Summary: Documentation Error: Default Executor does not implicitly 
construct resources
 Key: MESOS-6343
 URL: https://issues.apache.org/jira/browse/MESOS-6343
 Project: Mesos
  Issue Type: Documentation
Reporter: Joris Van Remoortere
Priority: Blocker


https://github.com/apache/mesos/blob/d16f53d5a9e15d1d9533739a8c052bc546ec3262/include/mesos/v1/mesos.proto#L544-L546

This probably got carried forward from early design discussions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6342) Not able to access TaskInfo's Data field from Tasks launched by CmdExecutor

2016-10-07 Thread Nima Vaziri (JIRA)
Nima Vaziri created MESOS-6342:
--

 Summary: Not able to access TaskInfo's Data field from Tasks 
launched by CmdExecutor
 Key: MESOS-6342
 URL: https://issues.apache.org/jira/browse/MESOS-6342
 Project: Mesos
  Issue Type: Bug
Reporter: Nima Vaziri


There's some config data that's being put in a TaskInfo's Data field on the 
Scheduler's side.

This data is of arbitrary size (in the order of hundreds of KB) so it might be 
possible to dump it into a file on the executor's side in case its size is big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-10-07 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556798#comment-15556798
 ] 

Avinash Sridharan commented on MESOS-5879:
--

@hasodent once we fix MESOS-6035, we can close this I am assuming?

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6323) 'mesos-containerizer launch' should inherit agent environment variables.

2016-10-07 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6323:
--
Target Version/s: 1.1.0

> 'mesos-containerizer launch' should inherit agent environment variables.
> 
>
> Key: MESOS-6323
> URL: https://issues.apache.org/jira/browse/MESOS-6323
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Priority: Critical
>
> If some dynamic libraries that agent depends on are stored in a non standard 
> location, and the operator starts the agent using LD_LIBRARY_PATH. When we 
> actually fork/exec the 'mesos-containerizer launch' helper, we need to make 
> sure it inherits agent's environment variables. Otherwise, it might throw 
> linking errors. This makes sense because it's a Mesos controlled process.
> However, the the helper actually fork/exec the user container (or executor), 
> we need to make sure to strip the agent environment variables.
> The tricky case is for default executor and command executor. These two are 
> controlled by Mesos as well, we also want them to have agent environment 
> variables. We need to somehow distinguish this from custom executor case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6106) Validate the host ports which container wants to expose to are from the resources assigned to it

2016-10-07 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556791#comment-15556791
 ] 

Avinash Sridharan commented on MESOS-6106:
--

[~qianzhang] had a discussion with [~jieyu] and he wanted to land the 
port-mapper CNI plugin in 1.1.0, which is probably a week away. Wanted to check 
if we can get this done in that time frame. Going and marking the Target 
version as 1.1.0 for the time being so that it shows up on the dashboard.

> Validate the host ports which container wants to expose to are from the 
> resources assigned to it
> 
>
> Key: MESOS-6106
> URL: https://issues.apache.org/jira/browse/MESOS-6106
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> In CNI isolator, we need to validate the host ports which container wants to 
> expose to ({{NetworkInfo.PortMapping.host_port}}) are from the resources 
> assigned to it (i.e., from the resource offer used by framework to launch 
> container), so that we can ensure container will not expose to an arbitrary 
> host port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6106) Validate the host ports which container wants to expose to are from the resources assigned to it

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6106:
-
Target Version/s: 1.1.0

> Validate the host ports which container wants to expose to are from the 
> resources assigned to it
> 
>
> Key: MESOS-6106
> URL: https://issues.apache.org/jira/browse/MESOS-6106
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> In CNI isolator, we need to validate the host ports which container wants to 
> expose to ({{NetworkInfo.PortMapping.host_port}}) are from the resources 
> assigned to it (i.e., from the resource offer used by framework to launch 
> container), so that we can ensure container will not expose to an arbitrary 
> host port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6017) Introduce `PortMapping` protobuf.

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6017:
-
Target Version/s: 1.1.0

> Introduce `PortMapping` protobuf.
> -
>
> Key: MESOS-6017
> URL: https://issues.apache.org/jira/browse/MESOS-6017
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Currently we have a `PortMapping` message defined for `DockerInfo`. This can 
> be used only by the `DockerContainerizer`. We need to introduce a new 
> Protobuf message in `NetworkInfo` which will allow frameworks to specify port 
> mapping when using CNI with the `MesosContainerizer`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6022) unit-test for the port mapper plugin

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6022:
-
Target Version/s: 1.1.0

> unit-test for the port mapper plugin
> 
>
> Key: MESOS-6022
> URL: https://issues.apache.org/jira/browse/MESOS-6022
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Write unit-tests for the port mapper plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6023) Create a binary for the port-mapper plugin

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6023:
-
Target Version/s: 1.1.0
   Fix Version/s: 1.1.0

> Create a binary for the port-mapper plugin
> --
>
> Key: MESOS-6023
> URL: https://issues.apache.org/jira/browse/MESOS-6023
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
> Fix For: 1.1.0
>
>
> The CNI port mapper plugin needs to be a separate binary that will be invoked 
> by the `network/cni` isolator as a CNI plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6040) Add a CMake build for `mesos-port-mapper`

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6040:
-
Target Version/s: 1.1.0

> Add a CMake build for `mesos-port-mapper`
> -
>
> Key: MESOS-6040
> URL: https://issues.apache.org/jira/browse/MESOS-6040
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Once the port-mapper binary compiles with GNU make, we need to modify the 
> CMake to build the port-mapper binary as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6282) CNI isolator should print plugin's stderr

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-6282:
-
Target Version/s: 1.1.0

> CNI isolator should print plugin's stderr
> -
>
> Key: MESOS-6282
> URL: https://issues.apache.org/jira/browse/MESOS-6282
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, isolation, network
>Reporter: Dan Osborne
>Assignee: Avinash Sridharan
>
> It's quite difficult for both Operators and CNI plugin developers to diagnose 
> CNI plugin errors in production or in test when the only error information 
> available is the stdout error string returned by the plugin (assuming it 
> managed to even print its correctly formatted text to stdout).
> Many CNI plugins print logging information to stderr, [as per the CNI 
> spec|https://github.com/containernetworking/cni/blob/master/SPEC.md#result]:
> bq. In addition, stderr can be used for unstructured output such as logs.
> Therefore, I propose the Mesos CNI Isolator capture the CNI plugin's stderr 
> output and log it to the Agent Logs, for easier diagnosis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6337) Nested containers getting killed before network isolation can be applied to them.

2016-10-07 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556678#comment-15556678
 ] 

Avinash Sridharan commented on MESOS-6337:
--

I looked into this issue, and it turns out its a duplicate of 
https://issues.apache.org/jira/browse/MESOS-6323

Looking at the stderr of the failed nested containers. Saw the following error 
messages:
mesos-containerizer: error while loading shared libraries: libssl.so.1.0.0: 
cannot open shared object file: No such file or directory

So its problem of the containers inheriting the right environment variables.

> Nested containers getting killed before network isolation can be applied to 
> them.
> -
>
> Key: MESOS-6337
> URL: https://issues.apache.org/jira/browse/MESOS-6337
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Seeing this odd behavior in one of our clusters:
> ```
> http.cpp:1948] Failed to launch nested container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to seed container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to setup hostname and network files: Failed to enter 
> the mount namespace of pid 21591: Pid 21591 does not exist
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 
> 31531 containerizer.cpp:1931] Destroying container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in 
> ISOLATING state
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 
> 31531 containerizer.cpp:2300] Container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has 
> exited
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 
> 31534 systemd.cpp:96] Assigned child process '21591' to 
> 'mesos_executors.slice'
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 
> 21580 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL 
> socket
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
> LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 
> 21580 openssl.cpp:432] Will only verify peer certificate if presented!
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
> LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 
> 21580 openssl.cpp:426] Will not verify peer certificate!
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 
> 21580 openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory 
> path with LIBPROCESS_SSL_CA_DIR=
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 
> 21580 openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL 
> socket
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before 
> InitGoogleLogging() is written to STDERR
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 
> 21581 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL 
> socket
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
> LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
> ```
> The above log is "reverse" chronological order, so please read it bottom up.
> The relevant log is:
> ```
> http.cpp:1948] Failed to launch nested container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to seed container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to setup hostname and network files: Failed to enter 
> the mount namespace of pid 21591: Pid 21591 does not exist
> ```
> Looks like the nested container failed to launch because the `isolate` call 
> to the `network/cni` isolator failed. Seems like when the isolator received 
> the `isolate` call the PID for the nested container has already exited and it 
> couldn't enter its mount namespace to setup the network files. 
> The odd thing here is that the nested container would have been frozen, and 
> hence was not running, so not sure what killed the nested container. My 
> suspicion falls on systemd, since I also see this log message:
> ```
> Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 
> 31532 systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice'
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6337) Nested containers getting killed before network isolation can be applied to them.

2016-10-07 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6337:
--
Fix Version/s: (was: 1.1.0)

> Nested containers getting killed before network isolation can be applied to 
> them.
> -
>
> Key: MESOS-6337
> URL: https://issues.apache.org/jira/browse/MESOS-6337
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Seeing this odd behavior in one of our clusters:
> ```
> http.cpp:1948] Failed to launch nested container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to seed container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to setup hostname and network files: Failed to enter 
> the mount namespace of pid 21591: Pid 21591 does not exist
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 
> 31531 containerizer.cpp:1931] Destroying container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in 
> ISOLATING state
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 
> 31531 containerizer.cpp:2300] Container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has 
> exited
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 
> 31534 systemd.cpp:96] Assigned child process '21591' to 
> 'mesos_executors.slice'
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 
> 21580 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL 
> socket
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
> LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 
> 21580 openssl.cpp:432] Will only verify peer certificate if presented!
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
> LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 
> 21580 openssl.cpp:426] Will not verify peer certificate!
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 
> 21580 openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory 
> path with LIBPROCESS_SSL_CA_DIR=
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 
> 21580 openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL 
> socket
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before 
> InitGoogleLogging() is written to STDERR
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 
> 21581 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL 
> socket
> Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
> LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
> ```
> The above log is "reverse" chronological order, so please read it bottom up.
> The relevant log is:
> ```
> http.cpp:1948] Failed to launch nested container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to seed container 
> cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
> Collect failed: Failed to setup hostname and network files: Failed to enter 
> the mount namespace of pid 21591: Pid 21591 does not exist
> ```
> Looks like the nested container failed to launch because the `isolate` call 
> to the `network/cni` isolator failed. Seems like when the isolator received 
> the `isolate` call the PID for the nested container has already exited and it 
> couldn't enter its mount namespace to setup the network files. 
> The odd thing here is that the nested container would have been frozen, and 
> hence was not running, so not sure what killed the nested container. My 
> suspicion falls on systemd, since I also see this log message:
> ```
> Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 
> 31532 systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice'
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.

2016-10-07 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6118:
--
Shepherd: Jie Yu

> Agent would crash with docker container tasks due to host mount table read.
> ---
>
> Key: MESOS-6118
> URL: https://issues.apache.org/jira/browse/MESOS-6118
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 1.0.1
> Environment: Build: 2016-08-26 23:06:27 by centos
> Version: 1.0.1
> Git tag: 1.0.1
> Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> systemd version `219` detected
> Inializing systemd state
> Created systemd slice: `/run/systemd/system/mesos_executors.slice`
> Started systemd slice `mesos_executors.slice`
> Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
>  Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 
> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jamie Briant
>Assignee: Kevin Klues
>Priority: Blocker
>  Labels: linux, slave
> Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, 
> cycle6.log, slave-crash.log
>
>
> I have a framework which schedules thousands of short running (a few seconds 
> to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the 
> slave process will crash every few minutes (with systemd restarting it).
> Crash is:
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678  1232 
> fs.cpp:140] Check failed: !visitedParents.contains(parentId)
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: 
> ***
> Version 1.0.0 works without this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-07 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556594#comment-15556594
 ] 

Guangya Liu commented on MESOS-6308:


Thanks [~bbannier] , I reproduced this issue again after running almost 1 hour 
and found it failed as following when adding metrics:

{code}
F1007 18:22:39.125012 255385600 sorter.cpp:458] Check failed: contains(name)
*** Check failure stack trace: ***
@0x108b7afda  google::LogMessage::Fail()
@0x108b79f67  google::LogMessage::SendToLog()
@0x108b7ac8a  google::LogMessage::Flush()
@0x108b81af8  google::LogMessageFatal::~LogMessageFatal()
@0x108b7b415  google::LogMessageFatal::~LogMessageFatal()
@0x106bcd4d5  
mesos::internal::master::allocator::DRFSorter::calculateShare()
@0x106bc710e  
mesos::internal::master::allocator::Metrics::add()::$_0::operator()()
@0x106bca6e2  
_ZZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNSt3__112basic_stringIcNS9_11char_traitsIcEENS9_9allocatorIcE3$_0EENS_6FutureIdEERKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clEST_
@0x106bca6a0  
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS3_6FutureIdEERKNS3_4UPIDEOT_EUlPNS3_11ProcessBaseEE_SW_EEEvDpOT_
@0x106bca34c  
_ZNSt3__110__function6__funcIZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS2_6FutureIdEERKNS2_4UPIDEOT_EUlPNS2_11ProcessBaseEE_NSF_ISW_EEFvSV_EEclEOSV_
@0x108a598df  std::__1::function<>::operator()()
@0x108a2a30f  process::ProcessBase::visit()
@0x108a8df9e  process::DispatchEvent::visit()
@0x100c65c51  process::ProcessBase::serve()
@0x108a26fe1  process::ProcessManager::resume()
@0x108a32ad6  
process::ProcessManager::init_threads()::$_1::operator()()
@0x108a32779  
_ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EPvS6_
@ 0x7fff957a405a  _pthread_body
@ 0x7fff957a3fd7  _pthread_start
@ 0x7fff957a13ed  thread_start
E1007 18:23:06.083991 317579264 process.cpp:2154] Failed to shutdown socket 
with fd 15: Socket is not connected
Abort trap: 6
{code}

Will check more for if there are case that we can add metrics for a non 
existent client? [~bbannier] , please show your comments if any. Thanks.


> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Guangya Liu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" 

[jira] [Updated] (MESOS-6341) Improve environment variable setting for executors, tasks and nested containers.

2016-10-07 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6341:
--
Component/s: slave
 containerization

> Improve environment variable setting for executors, tasks and nested 
> containers.
> 
>
> Key: MESOS-6341
> URL: https://issues.apache.org/jira/browse/MESOS-6341
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, slave
>Reporter: Jie Yu
>
> This is an umbrella ticket to track all the environment variable related 
> tickets in Mesos that need to be solved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2016-10-07 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3740:
--
Story Points:   (was: 3)

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6341) Improve environment variable setting for executors, tasks and nested containers.

2016-10-07 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6341:
-

 Summary: Improve environment variable setting for executors, tasks 
and nested containers.
 Key: MESOS-6341
 URL: https://issues.apache.org/jira/browse/MESOS-6341
 Project: Mesos
  Issue Type: Epic
Reporter: Jie Yu


This is an umbrella ticket to track all the environment variable related 
tickets in Mesos that need to be solved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6340) Set HOME for Mesos tasks

2016-10-07 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556516#comment-15556516
 ] 

Zameer Manji commented on MESOS-6340:
-

Thermos (Aurora's executor) works around this issue by setting {{$HOME}} to cwd 
{{$WORK_DIR}} and/or using {{$MESOS_SANDBOX}} when it is set.

I think [~joshua.cohen] can confirm or deny this.

Personally, if $HOME could default to those values that would be fantastic. 
Executors can do their own customization if needed, but setting something would 
be better than nothing.

> Set HOME for Mesos tasks
> 
>
> Key: MESOS-6340
> URL: https://issues.apache.org/jira/browse/MESOS-6340
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, slave
>Reporter: Cody Maloney
>Assignee: Jie Yu
>
> Quite a few programs assume {{$HOME}} points to a user-editable data file 
> directory.
> One example is PYTHON, which tries to look up $HOME to find user-installed 
> pacakges, and if that fails it tries to look up the user in the passwd 
> database which often goes badly (The container is running under the `nobody` 
> user):
> {code}
> if i == 1:
> if 'HOME' not in os.environ:
> import pwd
> userhome = pwd.getpwuid(os.getuid()).pw_dir
> else:
> userhome = os.environ['HOME']
> {code}
> Just setting HOME by default to WORK_DIR would enable more software to work 
> correctly out of the box. Software which needs to specialize / change it (or 
> schedulers with specific preferences), should still be able to set it 
> arbitrarily and anything a scheduler explicitly sets should overwrite the 
> default value of $WORK_DIR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6340) Set HOME for Mesos tasks

2016-10-07 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-6340:
---

 Summary: Set HOME for Mesos tasks
 Key: MESOS-6340
 URL: https://issues.apache.org/jira/browse/MESOS-6340
 Project: Mesos
  Issue Type: Bug
  Components: containerization, slave
Reporter: Cody Maloney
Assignee: Jie Yu


Quite a few programs assume {{$HOME}} points to a user-editable data file 
directory.

One example is PYTHON, which tries to look up $HOME to find user-installed 
pacakges, and if that fails it tries to look up the user in the passwd database 
which often goes badly (The container is running under the `nobody` user):

{code}
if i == 1:
if 'HOME' not in os.environ:
import pwd
userhome = pwd.getpwuid(os.getuid()).pw_dir
else:
userhome = os.environ['HOME']
{code}

Just setting HOME by default to WORK_DIR would enable more software to work 
correctly out of the box. Software which needs to specialize / change it (or 
schedulers with specific preferences), should still be able to set it 
arbitrarily and anything a scheduler explicitly sets should overwrite the 
default value of $WORK_DIR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5578) Support static address allocation in CNI

2016-10-07 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5578:
-
Affects Version/s: (was: 1.0.0)

> Support static address allocation in CNI
> 
>
> Key: MESOS-5578
> URL: https://issues.apache.org/jira/browse/MESOS-5578
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Currently a framework can't specify a static IP address for the container 
> when using the network/cni isolator.
> The `ipaddress` field in the `NetworkInfo` protobuf was designed for this 
> specific purpose but since the CNI spec does not specify a means to allocate 
> an IP address to the container the `network/cni` isolator cannot honor this 
> field even when it is filled in by the framework.
> Creating this ticket to act as a place holder to track this limitation. As 
> and when the CNI spec allows us to specify a static IP address for the 
> container, we can resolve this ticket. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6339) Support docker registry that requires basic auth.

2016-10-07 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6339:
-

 Summary: Support docker registry that requires basic auth.
 Key: MESOS-6339
 URL: https://issues.apache.org/jira/browse/MESOS-6339
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu


Currently, we assume Bearer auth (in Mesos containerizer) because it's what 
docker hub uses. We also need to support basic auth for some private registry 
that people deploys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6239) Fix warnings and errors produced by new hardened CXXFLAGS

2016-10-07 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-6239:
--
Description: 
Most of the new warnings/errors come from libprocess/stout as there were never 
any CXXFLAGS propagated to them.

https://reviews.apache.org/r/52647/

  was:Most of the new warnings/errors come from libprocess/stout as there were 
never any CXXFLAGS propagated to them.


> Fix warnings and errors produced by new hardened CXXFLAGS
> -
>
> Key: MESOS-6239
> URL: https://issues.apache.org/jira/browse/MESOS-6239
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Minor
>  Labels: c++, clang, gcc, libprocess, security, stout
>
> Most of the new warnings/errors come from libprocess/stout as there were 
> never any CXXFLAGS propagated to them.
> https://reviews.apache.org/r/52647/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6229) Default to using hardened compilation flags

2016-10-07 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-6229:
--
Description: 
Provide a default set of hardened compilation flags to help protect against 
overflows and other attacks. Apply to libprocess and stout as well. Current set 
of flags that were discussed on slack to implement:

-Wformat­-security
-Wstack-protector
-fstack-protector-strong (-fstack-protector-all might be overkill, it could be 
more effective to use this. Requires gcc >= 4.9 which should be reasonable)
-pie
-fPIE 
-fPIC
-D_FORTIFY_SOURCE=2
­-Wl,-z,relro,-z,now (currently not a part of the patch)
-fno-omit-frame-pointer

https://reviews.apache.org/r/52645/


  was:
Provide a default set of hardened compilation flags to help protect against 
overflows and other attacks. Apply to libprocess and stout as well. Current set 
of flags that were discussed on slack to implement:

-Wformat­-security
-Wstack-protector
-fstack-protector-strong (-fstack-protector-all might be overkill, it could be 
more effective to use this. Requires gcc >= 4.9)
-pie
-fPIE 
-fPIC
-D_FORTIFY_SOURCE=2
­-Wl,-z,relro,-z,now (currently not a part of the patch)
-fno-omit-frame-pointer

https://reviews.apache.org/r/52645/



> Default to using hardened compilation flags
> ---
>
> Key: MESOS-6229
> URL: https://issues.apache.org/jira/browse/MESOS-6229
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Minor
>  Labels: c++, clang, gcc, security
>
> Provide a default set of hardened compilation flags to help protect against 
> overflows and other attacks. Apply to libprocess and stout as well. Current 
> set of flags that were discussed on slack to implement:
> -Wformat­-security
> -Wstack-protector
> -fstack-protector-strong (-fstack-protector-all might be overkill, it could 
> be more effective to use this. Requires gcc >= 4.9 which should be reasonable)
> -pie
> -fPIE 
> -fPIC
> -D_FORTIFY_SOURCE=2
> ­-Wl,-z,relro,-z,now (currently not a part of the patch)
> -fno-omit-frame-pointer
> https://reviews.apache.org/r/52645/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6229) Default to using hardened compilation flags

2016-10-07 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-6229:
--
Description: 
Provide a default set of hardened compilation flags to help protect against 
overflows and other attacks. Apply to libprocess and stout as well. Current set 
of flags that were discussed on slack to implement:

-Wformat­-security
-Wstack-protector
-fstack-protector-strong (-fstack-protector-all might be overkill, it could be 
more effective to use this. Requires gcc >= 4.9)
-pie
-fPIE 
-fPIC
-D_FORTIFY_SOURCE=2
­-Wl,-z,relro,-z,now (currently not a part of the patch)
-fno-omit-frame-pointer

https://reviews.apache.org/r/52645/


  was:
Provide a default set of hardened compilation flags to help protect against 
overflows and other attacks. Apply to libprocess and stout as well. Current set 
of flags that were discussed on slack to implement:

-Wformat­-security
-Wstack-protector
-fstack-protector-strong (-fstack-protector-all might be overkill, it could be 
more effective to use this. Requires gcc >= 4.9)
-pie
-fPIE 
-D_FORTIFY_SOURCE=2
­-Wl,-z,relro,-z,now (currently not a part of the patch)
-fno-omit-frame-pointer

https://reviews.apache.org/r/52645/



> Default to using hardened compilation flags
> ---
>
> Key: MESOS-6229
> URL: https://issues.apache.org/jira/browse/MESOS-6229
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Minor
>  Labels: c++, clang, gcc, security
>
> Provide a default set of hardened compilation flags to help protect against 
> overflows and other attacks. Apply to libprocess and stout as well. Current 
> set of flags that were discussed on slack to implement:
> -Wformat­-security
> -Wstack-protector
> -fstack-protector-strong (-fstack-protector-all might be overkill, it could 
> be more effective to use this. Requires gcc >= 4.9)
> -pie
> -fPIE 
> -fPIC
> -D_FORTIFY_SOURCE=2
> ­-Wl,-z,relro,-z,now (currently not a part of the patch)
> -fno-omit-frame-pointer
> https://reviews.apache.org/r/52645/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6229) Default to using hardened compilation flags

2016-10-07 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-6229:
--
Description: 
Provide a default set of hardened compilation flags to help protect against 
overflows and other attacks. Apply to libprocess and stout as well. Current set 
of flags that were discussed on slack to implement:

-Wformat­-security
-Wstack-protector
-fstack-protector-strong (-fstack-protector-all might be overkill, it could be 
more effective to use this. Requires gcc >= 4.9)
-pie
-fPIE 
-D_FORTIFY_SOURCE=2
-O2 (possibly -O3 for greater optimizations, up for discussion)
­-Wl,-z,relro,-z,now (currently not a part of the patch)
-fno-omit-frame-pointer

https://reviews.apache.org/r/52645/


  was:
Provide a default set of hardened compilation flags to help protect against 
overflows and other attacks. Apply to libprocess and stout as well. Current set 
of flags that were discussed on slack to implement:

-Wformat­-security
-Wstack-protector
-fstack-protector-all
-pie
-fPIE 
-D_FORTIFY_SOURCE=2
-O2 (possibly -O3 for greater optimizations, up for discussion)
­-Wl,-z,relro,-z,now
-fno-omit-frame-pointer
-fstack-protector-strong (-fstack-protector-all might be overkill, it could be 
more effective to use this. Requires gcc >= 4.9)



> Default to using hardened compilation flags
> ---
>
> Key: MESOS-6229
> URL: https://issues.apache.org/jira/browse/MESOS-6229
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Minor
>  Labels: c++, clang, gcc, security
>
> Provide a default set of hardened compilation flags to help protect against 
> overflows and other attacks. Apply to libprocess and stout as well. Current 
> set of flags that were discussed on slack to implement:
> -Wformat­-security
> -Wstack-protector
> -fstack-protector-strong (-fstack-protector-all might be overkill, it could 
> be more effective to use this. Requires gcc >= 4.9)
> -pie
> -fPIE 
> -D_FORTIFY_SOURCE=2
> -O2 (possibly -O3 for greater optimizations, up for discussion)
> ­-Wl,-z,relro,-z,now (currently not a part of the patch)
> -fno-omit-frame-pointer
> https://reviews.apache.org/r/52645/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6283) Fix the Web UI allowing access to the task sandbox for nested containers.

2016-10-07 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6283:
--
Target Version/s: 1.1.0
Priority: Blocker  (was: Major)
   Fix Version/s: (was: 1.1.0)

> Fix the Web UI allowing access to the task sandbox for nested containers.
> -
>
> Key: MESOS-6283
> URL: https://issues.apache.org/jira/browse/MESOS-6283
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: Anand Mazumdar
>Assignee: haosdent
>Priority: Blocker
>  Labels: mesosphere
> Attachments: sandbox.gif
>
>
> Currently, the sandbox button for a child task is broken on the WebUI. It 
> does nothing and dies with an error that the executor for this task cannot be 
> found. We need to fix the WebUI to follow the symlink "tasks/taskId" and 
> display the task sandbox to the users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6338) Support --revocable_cpu_low_priority flag for docker containerizer

2016-10-07 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556038#comment-15556038
 ] 

Jie Yu commented on MESOS-6338:
---

Sounds good. Keep in mind that docker containerizer will receive less support 
than mesos containerizer in the future, and new features (e.g., pod, gpus) will 
go to mesos containerizer first typically.

> Support --revocable_cpu_low_priority flag for docker containerizer
> --
>
> Key: MESOS-6338
> URL: https://issues.apache.org/jira/browse/MESOS-6338
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Kunal Thakar
>
> The mesos containerizer supports setting lower shares for revocable tasks by 
> passing --revocable_cpu_low_priority to the mesos agent. This flag is only 
> supported for mesos containerizer, but I don't see a reason why the behavior 
> can't be replicated for the docker containerizer. 
> On setting the flag, CPU shares assigned to revocable tasks are lower than 
> normal tasks 
> (https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/cgroups/subsystems/cpu.cpp#L83).
>  This does not happen in the docker containerizer 
> (https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1517),
>  but it can be easily replicated there. 
> I can send a patch if this is acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6317) Race in master/allocator when updating oversubscribed resources of an agent.

2016-10-07 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6317:
---
Summary: Race in master/allocator when updating oversubscribed resources of 
an agent.  (was: Race in master update slave.)

> Race in master/allocator when updating oversubscribed resources of an agent.
> 
>
> Key: MESOS-6317
> URL: https://issues.apache.org/jira/browse/MESOS-6317
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Fix For: 1.1.0
>
>
> Currently, when {{updateSlave}} in master, it will first rescind offers and 
> then updateSlave in allocator, but there is a race for this, there might be a 
> batch allocation inserted bwteen the two. In this case, the order will be 
> rescind offer -> batch allocation -> update slave. This order will cause some 
> issues when the oversubscribed resources was decreased.
> Suppose the oversubscribed resources was decreased from 2 to 1, then after 
> rescind offer finished, the batch allocation will allocate the old 2 
> oversubscribed resources again, then update slave will update the total 
> oversubscribed resources to 1. This will cause the agent host have some time 
> overcommitted due to the tasks can still use 2 oversubscribed resources but 
> not 1 oversubscribed resources, once the tasks using the 2 oversubscribed 
> resources finished, everything goes back.
> So here we should adjust the order of rescind offer and updateSlave in master 
> to avoid resource overcommit.
> If we update slave first then rescind offer, the order will be update slave 
> -> batch allocation -> rescind offer, this order will have no problem when 
> descreasing resources. Suppose the oversubscribed resources was decreased 
> from 2 to 1, then update slave will update total oversubscribed resources to 
> 1 directly, then the batch allocation will not allocate any oversubscribed 
> resources since there are more allocated than total oversubscribed resources, 
> then rescind offer will rescind all offers using oversubscribed resources. 
> This will not lead the agent host to be overcommitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5139) ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky

2016-10-07 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5139:

Assignee: (was: Gilbert Song)

> ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky
> --
>
> Key: MESOS-5139
> URL: https://issues.apache.org/jira/browse/MESOS-5139
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: Ubuntu14.04
>Reporter: Vinod Kone
>  Labels: mesosphere
>
> Found this on ASF CI while testing 0.28.1-rc2
> {code}
> [ RUN  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar
> E0406 18:29:30.870481   520 shell.hpp:93] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> E0406 18:29:30.870576   520 fetcher.cpp:59] Failed to create URI fetcher 
> plugin 'hadoop': Failed to create HDFS client: Failed to execute 'hadoop 
> version 2>&1'; the command was either not found or exited with a non-zero 
> exit status: 127
> I0406 18:29:30.871052   520 local_puller.cpp:90] Creating local puller with 
> docker registry '/tmp/3l8ZBv/images'
> I0406 18:29:30.873325   539 metadata_manager.cpp:159] Looking for image 'abc'
> I0406 18:29:30.874438   539 local_puller.cpp:142] Untarring image 'abc' from 
> '/tmp/3l8ZBv/images/abc.tar' to '/tmp/3l8ZBv/store/staging/5tw8bD'
> I0406 18:29:30.901916   547 local_puller.cpp:162] The repositories JSON file 
> for image 'abc' is '{"abc":{"latest":"456"}}'
> I0406 18:29:30.902304   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/rootfs'
> I0406 18:29:30.909144   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs'
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:183: Failure
> (imageInfo).failure(): Collect failed: Subprocess 'tar, tar, -x, -f, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar, -C, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs' failed: tar: This does not look 
> like a tar archive
> tar: Exiting with failure status due to previous errors
> [  FAILED  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar (243 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-10-07 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1969#comment-1969
 ] 

Ilya Pronin commented on MESOS-6207:


Thanks! Strange, on my RB profile page all three fields (first / last name and 
email) are filled in. But the "Keep profile information private" checkbox was 
checked. Could that cause the problem?

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Trivial
> Fix For: 1.1.0
>
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6250) Ensure valid task state before connecting with framework on master failover

2016-10-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1947#comment-1947
 ] 

Joseph Wu commented on MESOS-6250:
--

This, along with a variety of other partition scenarios are tracked in this 
epic:
https://issues.apache.org/jira/browse/MESOS-5344

> Ensure valid task state before connecting with framework on master failover
> ---
>
> Key: MESOS-6250
> URL: https://issues.apache.org/jira/browse/MESOS-6250
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0, 0.28.1, 1.0.1
> Environment: OS X 10.11.6
>Reporter: Markus Jura
>Priority: Minor
>
> During a Mesos master failover the master re-registers with its slaves to 
> receive the current state of the running tasks. It also reconnects to a 
> framework.
> In the documentation it is recommended that a framework performs an explicit 
> task reconciliation when the Mesos master re-registers: 
> http://mesos.apache.org/documentation/latest/reconciliation/
> When allowing a reconciliation of a framework, Mesos master should guarantee 
> that its task state is valid, i.e. the same as on the slaves. Otherwise, 
> Mesos can reply with status updates of state {{TASK_LOST}} even the tasks is 
> still running on the slave.
> Now, on Mesos master failover, Mesos does not guarantee that it first 
> re-registers with it slaves before it re-connects to a framework. So it can 
> occur that the framework connects before Mesos has finished or started the 
> re-registration with the slaves. When the framework then sends reconciliation 
> requests directly after a re-registration Mesos will reply with status 
> updates where the task state is wrong ({{TASK_LOST}} instead of 
> {{TASK_RUNNING}}).
> For a reconciliation request, Mesos should guarantee that the task state is 
> consistent with the slaves before it replies with a status update.
> Another possibility would be that Mesos sends a message to the framework once 
> it has re-registered with the slaves so that the framework then starts the 
> reconciliation. So far a framework can only delay the reconciliation for a 
> certain amount of time. But it does not know how long the delay should be 
> because Mesos is not notifying the framework when the task state is 
> consistent again. 
> *Log: Mesos master - connecting with framework before re-registering with 
> slaves*
> {code:bash}
> I0926 12:39:42.006933 4284416 detector.cpp:152] Detected a new leader: 
> (id='92')
> I0926 12:39:42.007242 1064960 group.cpp:706] Trying to get 
> '/mesos/json.info_92' in ZooKeeper
> I0926 12:39:42.008129 4284416 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5049) is detected
> I0926 12:39:42.008304 4284416 master.cpp:1847] The newly elected leader is 
> master@127.0.0.1:5049 with id 96178e81-8371-48af-ba5e-c79d16c27fab
> I0926 12:39:42.008332 4284416 master.cpp:1860] Elected as the leading master!
> I0926 12:39:42.008349 4284416 master.cpp:1547] Recovering from registrar
> I0926 12:39:42.008488 3211264 registrar.cpp:332] Recovering registrar
> I0926 12:39:42.015935 4284416 registrar.cpp:365] Successfully fetched the 
> registry (0B) in 7.426816ms
> I0926 12:39:42.015985 4284416 registrar.cpp:464] Applied 1 operations in 
> 11us; attempting to update the 'registry'
> I0926 12:39:42.021425 4284416 registrar.cpp:509] Successfully updated the 
> 'registry' in 5.426176ms
> I0926 12:39:42.021462 4284416 registrar.cpp:395] Successfully recovered 
> registrar
> I0926 12:39:42.021581 528384 master.cpp:1655] Recovered 0 agents from the 
> Registry (118B) ; allowing 10mins for agents to re-register
> I0926 12:39:42.299598 3747840 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'conductr' at 
> scheduler-65610031-d679-49e5-b7bd-6068500d4674@192.168.2.106:65290
> I0926 12:39:42.299697 3747840 master.cpp:2500] Subscribing framework conductr 
> with checkpointing disabled and capabilities [  ]
> I0926 12:39:42.300122 2674688 hierarchical.cpp:271] Added framework conductr
> I0926 12:39:42.954983 1601536 master.cpp:4787] Re-registering agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 12:39:42.955189 1064960 registrar.cpp:464] Applied 1 operations in 
> 60us; attempting to update the 'registry'
> I0926 12:39:42.955893 1064960 registrar.cpp:509] Successfully updated the 
> 'registry' in 649984ns
> I0926 12:39:42.956224 4284416 master.cpp:7447] Adding task 
> c69df81e-35f4-4c2e-863b-4e9d5ae2e850 with resources mem(*):0 on agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1)
> I0926 12:39:42.956704 4284416 master.cpp:4872] Re-registered agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 
> (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; 
> 

[jira] [Created] (MESOS-6338) Support --revocable_cpu_low_priority flag for docker containerizer

2016-10-07 Thread Kunal Thakar (JIRA)
Kunal Thakar created MESOS-6338:
---

 Summary: Support --revocable_cpu_low_priority flag for docker 
containerizer
 Key: MESOS-6338
 URL: https://issues.apache.org/jira/browse/MESOS-6338
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Kunal Thakar


The mesos containerizer supports setting lower shares for revocable tasks by 
passing --revocable_cpu_low_priority to the mesos agent. This flag is only 
supported for mesos containerizer, but I don't see a reason why the behavior 
can't be replicated for the docker containerizer. 

On setting the flag, CPU shares assigned to revocable tasks are lower than 
normal tasks 
(https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/cgroups/subsystems/cpu.cpp#L83).
 This does not happen in the docker containerizer 
(https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1517),
 but it can be easily replicated there. 

I can send a patch if this is acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2723) The mesos-execute tool does not support zk:// master URLs

2016-10-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1894#comment-1894
 ] 

Joseph Wu commented on MESOS-2723:
--

The existing review is not quite correct (and has been discarded due to 
inactivity).

The fix should be to:
1) Make the {{--master}} a required flag (i.e. change {{Option master}} 
to {{string master}}.
2) Remove all custom (unnecessary) validation for {{flags.master}}.

> The mesos-execute tool does not support zk:// master URLs
> -
>
> Key: MESOS-2723
> URL: https://issues.apache.org/jira/browse/MESOS-2723
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.22.1
>Reporter: Tom Arnfeld
>  Labels: newbie
>
> It appears that the {{mesos-execute}} command line tool does it's own PID 
> validation of the {{--master}} param which prevents it from supporting 
> clusters managed with ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2723) The mesos-execute tool does not support zk:// master URLs

2016-10-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-2723:
-
Assignee: (was: Tom Arnfeld)
Story Points: 1
  Labels: newbie  (was: )

> The mesos-execute tool does not support zk:// master URLs
> -
>
> Key: MESOS-2723
> URL: https://issues.apache.org/jira/browse/MESOS-2723
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.22.1
>Reporter: Tom Arnfeld
>  Labels: newbie
>
> It appears that the {{mesos-execute}} command line tool does it's own PID 
> validation of the {{--master}} param which prevents it from supporting 
> clusters managed with ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6337) Nested containers getting killed before network isolation can be applied to them.

2016-10-07 Thread Avinash Sridharan (JIRA)
Avinash Sridharan created MESOS-6337:


 Summary: Nested containers getting killed before network isolation 
can be applied to them.
 Key: MESOS-6337
 URL: https://issues.apache.org/jira/browse/MESOS-6337
 Project: Mesos
  Issue Type: Bug
  Components: containerization
 Environment: Linux
Reporter: Avinash Sridharan
Assignee: Gilbert Song
 Fix For: 1.1.0


Seeing this odd behavior in one of our clusters:
```
http.cpp:1948] Failed to launch nested container 
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
Collect failed: Failed to seed container 
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
Collect failed: Failed to setup hostname and network files: Failed to enter the 
mount namespace of pid 21591: Pid 21591 does not exist
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 31531 
containerizer.cpp:1931] Destroying container 
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in 
ISOLATING state
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 31531 
containerizer.cpp:2300] Container 
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has 
exited
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 31534 
systemd.cpp:96] Assigned child process '21591' to 'mesos_executors.slice'
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 21580 
process.cpp:882] Failed SSL connections will be downgraded to a non-SSL socket
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 21580 
openssl.cpp:432] Will only verify peer certificate if presented!
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 21580 
openssl.cpp:426] Will not verify peer certificate!
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 21580 
openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory path 
with LIBPROCESS_SSL_CA_DIR=
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 21580 
openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL socket
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before 
InitGoogleLogging() is written to STDERR
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 21581 
process.cpp:882] Failed SSL connections will be downgraded to a non-SSL socket
Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set 
LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
```
The above log is "reverse" chronological order, so please read it bottom up.

The relevant log is:
```
http.cpp:1948] Failed to launch nested container 
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
Collect failed: Failed to seed container 
cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: 
Collect failed: Failed to setup hostname and network files: Failed to enter the 
mount namespace of pid 21591: Pid 21591 does not exist
```
Looks like the nested container failed to launch because the `isolate` call to 
the `network/cni` isolator failed. Seems like when the isolator received the 
`isolate` call the PID for the nested container has already exited and it 
couldn't enter its mount namespace to setup the network files. 

The odd thing here is that the nested container would have been frozen, and 
hence was not running, so not sure what killed the nested container. My 
suspicion falls on systemd, since I also see this log message:
```
Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 31532 
systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice'
```





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6142) Frameworks may RESERVE for an arbitrary role.

2016-10-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1785#comment-1785
 ] 

Gastón Kleiman commented on MESOS-6142:
---

Patch: https://reviews.apache.org/r/52642/

> Frameworks may RESERVE for an arbitrary role.
> -
>
> Key: MESOS-6142
> URL: https://issues.apache.org/jira/browse/MESOS-6142
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: mesosphere, reservations
>
> The master does not validate that resources from a reservation request have 
> the same role the framework is registered with. As a result, frameworks may 
> reserve resources for arbitrary roles.
> I've modified the role in [the {{ReserveThenUnreserve}} 
> test|https://github.com/apache/mesos/blob/bca600cf5602ed8227d91af9f73d689da14ad786/src/tests/reservation_tests.cpp#L117]
>  to "yoyo" and observed the following in the test's log:
> {noformat}
> I0908 18:35:43.379122 2138112 master.cpp:3362] Processing ACCEPT call for 
> offers: [ dfaf67e6-7c1c-4988-b427-c49842cb7bb7-O0 ] on agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train) for framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- 
> (default) at 
> scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116
> I0908 18:35:43.379170 2138112 master.cpp:3022] Authorizing principal 
> 'test-principal' to reserve resources 'cpus(yoyo, test-principal):1; 
> mem(yoyo, test-principal):512'
> I0908 18:35:43.379678 2138112 master.cpp:3642] Applying RESERVE operation for 
> resources cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 from 
> framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- (default) at 
> scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 to agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train)
> I0908 18:35:43.379767 2138112 master.cpp:7341] Sending checkpointed resources 
> cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 to agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train)
> I0908 18:35:43.380273 3211264 slave.cpp:2497] Updated checkpointed resources 
> from  to cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512
> I0908 18:35:43.380574 2674688 hierarchical.cpp:760] Updated allocation of 
> framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- on agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 from cpus(*):1; mem(*):512; 
> disk(*):470841; ports(*):[31000-32000] to ports(*):[31000-32000]; cpus(yoyo, 
> test-principal):1; disk(*):470841; mem(yoyo, test-principal):512 with RESERVE 
> operation
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6335) Add user doc for task group tasks

2016-10-07 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6335:
--
Shepherd: Vinod Kone

> Add user doc for task group tasks
> -
>
> Key: MESOS-6335
> URL: https://issues.apache.org/jira/browse/MESOS-6335
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky

2016-10-07 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1621#comment-1621
 ] 

Greg Mann edited comment on MESOS-6336 at 10/7/16 4:57 PM:
---

Here's a partial log from the ASF CI as well, from 10 days ago. This one was 
CentOS 7:
{code}
I0927 06:49:21.610502 30001 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 06:49:21.610563 30003 recover.cpp:568] Updating replica status to VOTING
I0927 06:49:21.610743 30001 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 06:49:21.610916 30001 master.cpp:584] Authorization enabled
I0927 06:49:21.611145 30011 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0927 06:49:21.611171 30013 whitelist_watcher.cpp:77] No whitelist given
I0927 06:49:21.611275 30009 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 414250ns
I0927 06:49:21.611301 30009 replica.cpp:320] Persisted replica status to VOTING
I0927 06:49:21.611450 30008 recover.cpp:582] Successfully joined the Paxos group
I0927 06:49:21.611651 30008 recover.cpp:466] Recover process terminated
I0927 06:49:21.613910 30012 master.cpp:2013] Elected as the leading master!
I0927 06:49:21.613943 30012 master.cpp:1560] Recovering from registrar
I0927 06:49:21.614099 30013 registrar.cpp:329] Recovering registrar
I0927 06:49:21.614842 30012 log.cpp:553] Attempting to start the writer
I0927 06:49:21.616055 30014 replica.cpp:493] Replica received implicit promise 
request from __req_res__(6052)@172.17.0.2:49598 with proposal 1
I0927 06:49:21.616436 30014 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 345420ns
I0927 06:49:21.616459 30014 replica.cpp:342] Persisted promised to 1
I0927 06:49:21.616914 30006 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0927 06:49:21.618098 30006 replica.cpp:388] Replica received explicit promise 
request from __req_res__(6053)@172.17.0.2:49598 for position 0 with proposal 2
I0927 06:49:21.618446 30006 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 305036ns
I0927 06:49:21.618474 30006 replica.cpp:708] Persisted action NOP at position 0
I0927 06:49:21.619513 30012 replica.cpp:537] Replica received write request for 
position 0 from __req_res__(6054)@172.17.0.2:49598
I0927 06:49:21.619604 30012 leveldb.cpp:436] Reading position from leveldb took 
55504ns
I0927 06:49:21.619915 30012 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 262919ns
I0927 06:49:21.619941 30012 replica.cpp:708] Persisted action NOP at position 0
I0927 06:49:21.620503 30016 replica.cpp:691] Replica received learned notice 
for position 0 from @0.0.0.0:0
I0927 06:49:21.620851 30016 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 313765ns
I0927 06:49:21.620878 30016 replica.cpp:708] Persisted action NOP at position 0
I0927 06:49:21.621417 30014 log.cpp:569] Writer started with ending position 0
I0927 06:49:21.622566 30013 leveldb.cpp:436] Reading position from leveldb took 
28375ns
I0927 06:49:21.623528 30005 registrar.cpp:362] Successfully fetched the 
registry (0B) in 9.373952ms
I0927 06:49:21.623668 30005 registrar.cpp:461] Applied 1 operations in 25023ns; 
attempting to update the registry
I0927 06:49:21.624490 30012 log.cpp:577] Attempting to append 168 bytes to the 
log
I0927 06:49:21.624620 30004 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 1
I0927 06:49:21.625282 30007 replica.cpp:537] Replica received write request for 
position 1 from __req_res__(6055)@172.17.0.2:49598
I0927 06:49:21.625720 30007 leveldb.cpp:341] Persisting action (187 bytes) to 
leveldb took 396032ns
I0927 06:49:21.625746 30007 replica.cpp:708] Persisted action APPEND at 
position 1
I0927 06:49:21.626509 30012 replica.cpp:691] Replica received learned notice 
for position 1 from @0.0.0.0:0
I0927 06:49:21.626986 30012 leveldb.cpp:341] Persisting action (189 bytes) to 
leveldb took 328126ns
I0927 06:49:21.627027 30012 replica.cpp:708] Persisted action APPEND at 
position 1
I0927 06:49:21.628249 30014 registrar.cpp:506] Successfully updated the 
registry in 4.504832ms
I0927 06:49:21.628463 30016 log.cpp:596] Attempting to truncate the log to 1
I0927 06:49:21.628484 30014 registrar.cpp:392] Successfully recovered registrar
I0927 06:49:21.628619 30005 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 2
I0927 06:49:21.629341 30010 master.cpp:1676] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0927 06:49:21.629361 30007 hierarchical.cpp:176] Skipping recovery of 
hierarchical allocator: nothing to recover
I0927 06:49:21.629873 30004 replica.cpp:537] Replica received write request for 
position 2 from __req_res__(6056)@172.17.0.2:49598
I0927 06:49:21.630329 30004 leveldb.cpp:341] Persisting action (16 bytes) to 

[jira] [Commented] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky

2016-10-07 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1621#comment-1621
 ] 

Greg Mann commented on MESOS-6336:
--

Here's a partial log from the ASF CI as well, from 10 days ago:
{code}
I0927 06:49:21.610502 30001 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0927 06:49:21.610563 30003 recover.cpp:568] Updating replica status to VOTING
I0927 06:49:21.610743 30001 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0927 06:49:21.610916 30001 master.cpp:584] Authorization enabled
I0927 06:49:21.611145 30011 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0927 06:49:21.611171 30013 whitelist_watcher.cpp:77] No whitelist given
I0927 06:49:21.611275 30009 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 414250ns
I0927 06:49:21.611301 30009 replica.cpp:320] Persisted replica status to VOTING
I0927 06:49:21.611450 30008 recover.cpp:582] Successfully joined the Paxos group
I0927 06:49:21.611651 30008 recover.cpp:466] Recover process terminated
I0927 06:49:21.613910 30012 master.cpp:2013] Elected as the leading master!
I0927 06:49:21.613943 30012 master.cpp:1560] Recovering from registrar
I0927 06:49:21.614099 30013 registrar.cpp:329] Recovering registrar
I0927 06:49:21.614842 30012 log.cpp:553] Attempting to start the writer
I0927 06:49:21.616055 30014 replica.cpp:493] Replica received implicit promise 
request from __req_res__(6052)@172.17.0.2:49598 with proposal 1
I0927 06:49:21.616436 30014 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 345420ns
I0927 06:49:21.616459 30014 replica.cpp:342] Persisted promised to 1
I0927 06:49:21.616914 30006 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0927 06:49:21.618098 30006 replica.cpp:388] Replica received explicit promise 
request from __req_res__(6053)@172.17.0.2:49598 for position 0 with proposal 2
I0927 06:49:21.618446 30006 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 305036ns
I0927 06:49:21.618474 30006 replica.cpp:708] Persisted action NOP at position 0
I0927 06:49:21.619513 30012 replica.cpp:537] Replica received write request for 
position 0 from __req_res__(6054)@172.17.0.2:49598
I0927 06:49:21.619604 30012 leveldb.cpp:436] Reading position from leveldb took 
55504ns
I0927 06:49:21.619915 30012 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 262919ns
I0927 06:49:21.619941 30012 replica.cpp:708] Persisted action NOP at position 0
I0927 06:49:21.620503 30016 replica.cpp:691] Replica received learned notice 
for position 0 from @0.0.0.0:0
I0927 06:49:21.620851 30016 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 313765ns
I0927 06:49:21.620878 30016 replica.cpp:708] Persisted action NOP at position 0
I0927 06:49:21.621417 30014 log.cpp:569] Writer started with ending position 0
I0927 06:49:21.622566 30013 leveldb.cpp:436] Reading position from leveldb took 
28375ns
I0927 06:49:21.623528 30005 registrar.cpp:362] Successfully fetched the 
registry (0B) in 9.373952ms
I0927 06:49:21.623668 30005 registrar.cpp:461] Applied 1 operations in 25023ns; 
attempting to update the registry
I0927 06:49:21.624490 30012 log.cpp:577] Attempting to append 168 bytes to the 
log
I0927 06:49:21.624620 30004 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 1
I0927 06:49:21.625282 30007 replica.cpp:537] Replica received write request for 
position 1 from __req_res__(6055)@172.17.0.2:49598
I0927 06:49:21.625720 30007 leveldb.cpp:341] Persisting action (187 bytes) to 
leveldb took 396032ns
I0927 06:49:21.625746 30007 replica.cpp:708] Persisted action APPEND at 
position 1
I0927 06:49:21.626509 30012 replica.cpp:691] Replica received learned notice 
for position 1 from @0.0.0.0:0
I0927 06:49:21.626986 30012 leveldb.cpp:341] Persisting action (189 bytes) to 
leveldb took 328126ns
I0927 06:49:21.627027 30012 replica.cpp:708] Persisted action APPEND at 
position 1
I0927 06:49:21.628249 30014 registrar.cpp:506] Successfully updated the 
registry in 4.504832ms
I0927 06:49:21.628463 30016 log.cpp:596] Attempting to truncate the log to 1
I0927 06:49:21.628484 30014 registrar.cpp:392] Successfully recovered registrar
I0927 06:49:21.628619 30005 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 2
I0927 06:49:21.629341 30010 master.cpp:1676] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0927 06:49:21.629361 30007 hierarchical.cpp:176] Skipping recovery of 
hierarchical allocator: nothing to recover
I0927 06:49:21.629873 30004 replica.cpp:537] Replica received write request for 
position 2 from __req_res__(6056)@172.17.0.2:49598
I0927 06:49:21.630329 30004 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 404029ns
I0927 06:49:21.630362 30004 replica.cpp:708] 

[jira] [Created] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky

2016-10-07 Thread Greg Mann (JIRA)
Greg Mann created MESOS-6336:


 Summary: SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
 Key: MESOS-6336
 URL: https://issues.apache.org/jira/browse/MESOS-6336
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Greg Mann


The test {{SlaveTest.KillTaskGroupBetweenRunTaskParts}} sometimes segfaults 
during the agent's {{finalize()}} method. This was observed on our internal CI, 
on Fedora with libev, without SSL:
{code}
[ RUN  ] SlaveTest.KillTaskGroupBetweenRunTaskParts
I1007 14:12:57.973811 28630 cluster.cpp:158] Creating default 'local' authorizer
I1007 14:12:57.982128 28630 leveldb.cpp:174] Opened db in 8.195028ms
I1007 14:12:57.982599 28630 leveldb.cpp:181] Compacted db in 446238ns
I1007 14:12:57.982616 28630 leveldb.cpp:196] Created db iterator in 3650ns
I1007 14:12:57.982622 28630 leveldb.cpp:202] Seeked to beginning of db in 451ns
I1007 14:12:57.982627 28630 leveldb.cpp:271] Iterated through 0 keys in the db 
in 352ns
I1007 14:12:57.982638 28630 replica.cpp:776] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I1007 14:12:57.983024 28645 recover.cpp:451] Starting replica recovery
I1007 14:12:57.983127 28651 recover.cpp:477] Replica is in EMPTY status
I1007 14:12:57.983459 28644 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(6234)@172.30.2.161:38776
I1007 14:12:57.983543 28651 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I1007 14:12:57.983680 28650 recover.cpp:568] Updating replica status to STARTING
I1007 14:12:57.983990 28648 master.cpp:380] Master 
76d4d55f-dcc6-4033-85d9-7ec97ef353cb 
(ip-172-30-2-161.ec2.internal.mesosphere.io) started on 172.30.2.161:38776
I1007 14:12:57.984007 28648 master.cpp:382] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/rVbcaO/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry_store_timeout="100secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/rVbcaO/master" --zk_session_timeout="10secs"
I1007 14:12:57.984127 28648 master.cpp:432] Master only allowing authenticated 
frameworks to register
I1007 14:12:57.984134 28648 master.cpp:446] Master only allowing authenticated 
agents to register
I1007 14:12:57.984139 28648 master.cpp:459] Master only allowing authenticated 
HTTP frameworks to register
I1007 14:12:57.984143 28648 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/rVbcaO/credentials'
I1007 14:12:57.988487 28648 master.cpp:504] Using default 'crammd5' 
authenticator
I1007 14:12:57.988530 28648 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1007 14:12:57.988585 28648 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1007 14:12:57.988648 28648 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1007 14:12:57.988680 28648 master.cpp:584] Authorization enabled
I1007 14:12:57.988757 28650 whitelist_watcher.cpp:77] No whitelist given
I1007 14:12:57.988772 28646 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I1007 14:12:57.988917 28651 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 5.186917ms
I1007 14:12:57.988934 28651 replica.cpp:320] Persisted replica status to 
STARTING
I1007 14:12:57.989045 28651 recover.cpp:477] Replica is in STARTING status
I1007 14:12:57.989316 28648 master.cpp:2013] Elected as the leading master!
I1007 14:12:57.989331 28648 master.cpp:1560] Recovering from registrar
I1007 14:12:57.989408 28649 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(6235)@172.30.2.161:38776
I1007 14:12:57.989423 28648 registrar.cpp:329] Recovering registrar
I1007 14:12:57.989792 28647 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I1007 14:12:57.989956 28650 recover.cpp:568] 

[jira] [Updated] (MESOS-6322) Agent fails to kill empty parent container

2016-10-07 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6322:
--
  Sprint: Mesosphere Sprint 44
Story Points: 3

> Agent fails to kill empty parent container
> --
>
> Key: MESOS-6322
> URL: https://issues.apache.org/jira/browse/MESOS-6322
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> I launched a pod using Marathon, which led to the launching of a task group 
> on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly 
> re-launching multiple instances of the task group. After this went on for a 
> few minutes, I told Marathon to scale the app to 0 instances, which sends 
> {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent 
> reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but 
> hitting the {{/containers}} endpoint on the agent reveals that the executor 
> container for this task group is still running.
> Here is the task group launching on the agent:
> {code}
> slave.cpp:1696] Launching task group containing tasks [ 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
> {code}
> and here is the executor container starting:
> {code}
> mesos-agent[2994]: I1006 20:23:27.407563  3094 containerizer.cpp:965] 
> Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 
> 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
> 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
> {code}
> and here is the output showing the {{TASK_KILLED}} updates for one task group:
> {code}
> mesos-agent[2994]: I1006 20:23:28.728224  3097 slave.cpp:2283] Asked to kill 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
> mesos-agent[2994]: W1006 20:23:28.728304  3097 slave.cpp:2364] Transitioning 
> the state of task 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- to TASK_KILLED because 
> the executor is not registered
> mesos-agent[2994]: I1006 20:23:28.728659  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
> mesos-agent[2994]: I1006 20:23:28.728817  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
> mesos-agent[2994]: I1006 20:23:28.728912  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
> {code}
> however, if we grep the log for the executor's ID, the last line mentioning 
> it is:
> {code}
> slave.cpp:3080] Creating a marker file for HTTP based executor 
> 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
> 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- (via HTTP) at path 
> '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker'
> {code}
> so it seems the executor never exited. If we hit the agent's {{/containers}} 
> endpoint, we get output which includes this executor container:
> {code}
> {
> "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32",
> "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601",
> "executor_name": "",
> "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-",
> "source": "",
> "statistics": {
>   "cpus_limit": 0.1,
>   "cpus_nr_periods": 17,
>   "cpus_nr_throttled": 11,
>   "cpus_system_time_secs": 0.02,
>   "cpus_throttled_time_secs": 0.784142448,
>   "cpus_user_time_secs": 0.09,
>   "disk_limit_bytes": 10485760,
>   "disk_used_bytes": 20480,
>   "mem_anon_bytes": 11337728,
>   "mem_cache_bytes": 0,
>   "mem_critical_pressure_counter": 0,
>   "mem_file_bytes": 0,
>   "mem_limit_bytes": 33554432,
>   "mem_low_pressure_counter": 0,
>   

[jira] [Commented] (MESOS-5275) Add capabilities support for unified containerizer.

2016-10-07 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1543#comment-1543
 ] 

Jie Yu commented on MESOS-5275:
---

commit 4ea9651aabd01f623f2578d2823271488d924c5b
Author: Benjamin Bannier 
Date:   Wed Oct 5 21:44:04 2016 -0700

Created an isolator for Linux capabilities.

Review: https://reviews.apache.org/r/50271/

commit f6a25360053fc38e843129cc7e1f9fe4cf757ecd
Author: Benjamin Bannier 
Date:   Wed Oct 5 21:35:40 2016 -0700

Reorganized includes in containerizer.

Review: https://reviews.apache.org/r/52081/

commit e7d1f53621a09da47ee7dc5d6fcd6326cb72792d
Author: Benjamin Bannier 
Date:   Wed Oct 5 21:28:12 2016 -0700

Added `ping` to test linux rootfs.

Review: https://reviews.apache.org/r/51931/

commit 5e3648c871f8008d8e11390b2ccba86c59d82f70
Author: Benjamin Bannier 
Date:   Wed Oct 5 20:55:42 2016 -0700

Introduced Linux capabilities support for Mesos executor.

This change introduces Linux capability-based security the Mesos
exector. A new flag `capabilities` is introduced to optionally specify
the capabilities tasks launched by the Mesos executor are allowed to
use.

Review: https://reviews.apache.org/r/51930/

> Add capabilities support for unified containerizer.
> ---
>
> Key: MESOS-5275
> URL: https://issues.apache.org/jira/browse/MESOS-5275
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Benjamin Bannier
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Add capabilities support for unified containerizer. 
> Requirements:
> 1. Use the mesos capabilities API.
> 2. Frameworks be able to add capability requests for containers.
> 3. Agents be able to add maximum allowed capabilities for all containers 
> launched.
> Design document: 
> https://docs.google.com/document/d/1YiTift8TQla2vq3upQr7K-riQ_pQ-FKOCOsysQJROGc/edit#heading=h.rgfwelqrskmd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-07 Thread Megha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1500#comment-1500
 ] 

Megha commented on MESOS-6223:
--

This jira came out as a pre-requisite to support task restart post a reboot. 
There are definitely use cases where you would need a persistent agent Id 
because resources like persistent volumes are not tied to the lifecycle of the 
ephemeral agent and exist even after the agent is gone. But the thing is in 
order to support task restart on the rebooted host we need the previous agent 
Id or session Id (from MESOS-5368) to recover and figure out which tasks to 
restart and restart them eventually. So, I believe the agent or session 
recovery post a reboot is needed. I believe recovery being short-circuited 
after reboot is an optimization because of the fact that no tasks/executors are 
running after agent's host reboot which will change with MESOS-3545.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6327) Large docker images make the mesos containerizer crash with: Too many levels of symbolic links

2016-10-07 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1358#comment-1358
 ] 

Gilbert Song commented on MESOS-6327:
-

[~a-nldisr] Thanks for reporting this issue. Currently, Mesos would select 
`copy` backend for unified containerizer by default. However, for better 
performance with large images (or too many layers #s), we would recommend using 
`overlay` backend, or `aufs`. We consider support auto backend by default in 
Mesos MESOS-5931.

We need to fix this issue in copy backend. Could you please test out if you are 
still blocked by using `overlay` backend? Hopefully that would resolve your 
issue. 

> Large docker images make the mesos containerizer crash with: Too many levels 
> of symbolic links
> --
>
> Key: MESOS-6327
> URL: https://issues.apache.org/jira/browse/MESOS-6327
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
> Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in 
> the Apache Aurora vagrant image
>Reporter: Rogier Dikkes
>Priority: Critical
>
> When deploying Mesos containers with large (6G+, 60+ layers) Docker images 
> the task crashes with the error: 
> Mesos agent logs: 
> E1007 08:40:12.954227  8117 slave.cpp:3976] Container 
> 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor 
> 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365'
>  of framework df
> c91a86-84b9-4539-a7be-4ace7b7b44a1- failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot stat 
> ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b
> ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’:
>  Too many levels of symbolic links
> ... (complete pastebin: http://pastebin.com/umZ4Q5d1 )
> How to replicate:
> Start the aurora vagrant image. Adjust the 
> /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file 
> /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker 
> image instead of the example. (you can use anldisr/jupyter:0.4 i created as a 
> test image, this is based upon the jupyter notebook stacks.). Create the job, 
> watch it fail after x number of minutes. 
> The mesos sandbox is empty. 
> Aurora errors i see: 
> 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect 
> failed: Failed to copy layer: cp: cannot stat 
> ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’:
>  Too many levels of symbolic links cp: cannot stat ... 
> Too many levels of symbolic links ; Container destroyed while provisioning 
> images
> (complete pastebin: http://pastebin.com/uecHYD5J )
> To rule out the image i started this and more images as a normal Docker 
> container. This works without issues. 
> Mesos flags related configured: 
> -appc_store_dir 
> /tmp/mesos/images/appc
> -containerizers 
> docker,mesos
> -executor_registration_timeout 
> 5mins
> -image_providers 
> appc,docker
> -image_provisioner_backend 
> copy
> -isolation 
> filesystem/linux,docker/runtime
> Affected Mesos versions tested: 1.0.1 & 1.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6335) Add user doc for task group tasks

2016-10-07 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-6335:
-

 Summary: Add user doc for task group tasks
 Key: MESOS-6335
 URL: https://issues.apache.org/jira/browse/MESOS-6335
 Project: Mesos
  Issue Type: Documentation
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6333) Don't send TASK_LOST when removing a framework from an agent

2016-10-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6333:
--

 Summary: Don't send TASK_LOST when removing a framework from an 
agent
 Key: MESOS-6333
 URL: https://issues.apache.org/jira/browse/MESOS-6333
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Neil Conway
Assignee: Neil Conway


Update this code:

{code}
  // Remove pointers to framework's tasks in slaves, and send status
  // updates.
  // NOTE: A copy is needed because removeTask modifies slave->tasks.
  foreachvalue (Task* task, utils::copy(slave->tasks[framework->id()])) {
// Remove tasks that belong to this framework.
if (task->framework_id() == framework->id()) {
  // A framework might not actually exist because the master failed
  // over and the framework hasn't reconnected yet. For more info
  // please see the comments in 'removeFramework(Framework*)'.
  const StatusUpdate& update = protobuf::createStatusUpdate(
task->framework_id(),
task->slave_id(),
task->task_id(),
TASK_LOST,
TaskStatus::SOURCE_MASTER,
None(),
"Slave " + slave->info.hostname() + " disconnected",
TaskStatus::REASON_SLAVE_DISCONNECTED,
(task->has_executor_id()
? Option(task->executor_id()) : None()));

  updateTask(task, update);
  removeTask(task);
  forward(update, UPID(), framework);
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6332) Don't send TASK_LOST in the agent

2016-10-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6332:
--

 Summary: Don't send TASK_LOST in the agent
 Key: MESOS-6332
 URL: https://issues.apache.org/jira/browse/MESOS-6332
 Project: Mesos
  Issue Type: Improvement
  Components: slave
Reporter: Neil Conway
Assignee: Neil Conway


The agent sends {{TASK_LOST}} to handle various error situations. For 
partition-aware frameworks, we should not send {{TASK_LOST}} -- we should send 
a more specific {{TaskState}}, depending on the exact circumstances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6331) Don't send TASK_LOST when accepting offers in a disconnected scheduler

2016-10-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6331:
--

 Summary: Don't send TASK_LOST when accepting offers in a 
disconnected scheduler
 Key: MESOS-6331
 URL: https://issues.apache.org/jira/browse/MESOS-6331
 Project: Mesos
  Issue Type: Improvement
  Components: scheduler driver
Reporter: Neil Conway
Assignee: Neil Conway


Update this to send TASK_DROPPED for partition-aware frameworks:

{code}
if (!connected) {
  VLOG(1) << "Ignoring accept offers message as master is disconnected";

  // NOTE: Reply to the framework with TASK_LOST messages for each
  // task launch. See details from notes in launchTasks.
  foreach (const Offer::Operation& operation, operations) {
if (operation.type() != Offer::Operation::LAUNCH) {
  continue;
}

foreach (const TaskInfo& task, operation.launch().task_infos()) {
  StatusUpdate update = protobuf::createStatusUpdate(
  framework.id(),
  None(),
  task.task_id(),
  TASK_LOST,
  TaskStatus::SOURCE_MASTER,
  None(),
  "Master disconnected",
  TaskStatus::REASON_MASTER_DISCONNECTED);

  statusUpdate(UPID(), update, UPID());
}
  }
  return;
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6330) Send TASK_UNKNOWN for tasks on unknown agents

2016-10-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6330:
--

 Summary: Send TASK_UNKNOWN for tasks on unknown agents
 Key: MESOS-6330
 URL: https://issues.apache.org/jira/browse/MESOS-6330
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Neil Conway
Assignee: Neil Conway


In Mesos <= 1.0, we send {{TASK_LOST}} for explicit reconciliation requests for 
tasks running on agents the master has never heard about.

For partition-aware frameworks in Mesos >= 1.1, we should instead send 
TASK_UNKNOWN in this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6329) Send TASK_DROPPED for task launch errors

2016-10-07 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6329:
--

 Summary: Send TASK_DROPPED for task launch errors
 Key: MESOS-6329
 URL: https://issues.apache.org/jira/browse/MESOS-6329
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Neil Conway
Assignee: Neil Conway


In Mesos <= 1.0, we send {{TASK_LOST}} for task launch attempts that fail due 
to a transient error (e.g., a concurrent dynamic reservation that consumes the 
resources the task launch was trying to use).

For PARTITION_AWARE frameworks, we should instead send TASK_DROPPED in this 
case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6328) Make initialization of openssl eager

2016-10-07 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6328:
---

 Summary: Make initialization of openssl eager
 Key: MESOS-6328
 URL: https://issues.apache.org/jira/browse/MESOS-6328
 Project: Mesos
  Issue Type: Bug
  Components: security
Reporter: Benjamin Bannier
Priority: Minor


Currently openssl is initialized lazily since {{openssl::initialize}} is called 
whenever the first ssl socket is created with 
{{LibeventSSLSocketImpl::create}}, while it should be possible to just call it 
in spots where {{process::initialize}} is called.

This was brought up during https://reviews.apache.org/r/52154/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6216) LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv

2016-10-07 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554910#comment-15554910
 ] 

Till Toenshoff commented on MESOS-6216:
---

Today.

> LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv
> --
>
> Key: MESOS-6216
> URL: https://issues.apache.org/jira/browse/MESOS-6216
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
> Attachments: build.log
>
>
> {{LibeventSSLSocketImpl::create}} is called whenever a potentially 
> ssl-enabled socket is created. It in turn calls {{openssl::initialize}} which 
> calls a function {{reinitialize}} using {{os::setenv}}. Here {{os::setenv}} 
> is used to set up SSL-related libprocess environment variables 
> {{LIBPROCESS_SSL_*}}.
> Since {{os::setenv}} is not thread-safe just like the {{::setenv}} it wraps, 
> any calling of functions like {{os::getenv}} (or via {{os::environment}}) 
> concurrently with the first invocation of {{LibeventSSLSocketImpl::create}} 
> performs unsynchronized r/w access to the same data structure in the runtime.
> We usually perform most setup of the environment before we start the 
> libprocess runtime with {{process::initialize}} from a {{main}} function, see 
> e.g., {{src/slave/main.cpp}} or {{src/master/main.cpp}} and others. It 
> appears that we should move the setup of libprocess' SSL environment 
> variables to a similar spot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-10-07 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554765#comment-15554765
 ] 

Till Toenshoff commented on MESOS-6207:
---

[~ipronin] your reviewboard profile seems to be incomplete causing your patch 
to not have an author attribute set. I manually fixed that for this patch but 
you might want to fix that permanently in your ReviewBoard account. The missing 
email address seems to be the root cause.

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Trivial
> Fix For: 1.1.0
>
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting

2016-10-07 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6321:
---
Shepherd: Michael Park
  Sprint: Mesosphere Sprint 44
Story Points: 1
Target Version/s: 1.1.0

> CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
> -
>
> Key: MESOS-6321
> URL: https://issues.apache.org/jira/browse/MESOS-6321
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Observed in internal CI:
> {noformat}
> [15:52:21] : [Step 10/10] [ RUN  ] 
> HierarchicalAllocatorTest.NoDoubleAccounting
> [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 
> hierarchical.cpp:275] Added framework framework1
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] 
> Handling HTTP event for process 'metrics' with path: '/metrics/snapshot'
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 
> hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 
> hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 
> hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 
> hierarchical.cpp:275] Added framework framework2
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 
> hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns
> [15:52:21]W: [Step 10/10] F1006 15:52:21.824954 23692 json.hpp:334] Check 
> failed: 'boost::get(this)' Must be non NULL
> [15:52:21]W: [Step 10/10] *** Check failure stack trace: ***
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbd71d  
> google::LogMessage::Fail()
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbf55d  
> google::LogMessage::SendToLog()
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbd30c  
> google::LogMessage::Flush()
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbfe59  
> google::LogMessageFatal::~LogMessageFatal()
> [15:52:21]W: [Step 10/10] @   0x7cc903  JSON::Value::as<>()
> [15:52:21]W: [Step 10/10] @   0x8b633c  
> mesos::internal::tests::HierarchicalAllocatorTest_NoDoubleAccounting_Test::TestBody()
> [15:52:21]W: [Step 10/10] @  0x129ce23  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [15:52:21]W: [Step 10/10] @  0x1292f07  testing::Test::Run()
> [15:52:21]W: [Step 10/10] @  0x1292fae  
> testing::TestInfo::Run()
> [15:52:21]W: [Step 10/10] @  0x12930b5  
> testing::TestCase::Run()
> [15:52:21]W: [Step 10/10] @  0x1293368  
> testing::internal::UnitTestImpl::RunAllTests()
> [15:52:21]W: [Step 10/10] @  0x1293624  
> testing::UnitTest::Run()
> [15:52:21]W: [Step 10/10] @   0x507254  main
> [15:52:21]W: [Step 10/10] @ 0x7fe95122876d  (unknown)
> [15:52:21]W: [Step 10/10] @   0x51e341  (unknown)
> [15:52:21]W: [Step 10/10] Aborted (core dumped)
> [15:52:21]W: [Step 10/10] Process exited with code 134
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting

2016-10-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554654#comment-15554654
 ] 

Alexander Rukletsov commented on MESOS-6321:


Good run should look like this:
{noformat}
[ RUN  ] HierarchicalAllocatorTest.NoDoubleAccounting
I1007 11:29:37.357229 3211264 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I1007 11:29:37.357724 1601536 hierarchical.cpp:275] Added framework framework1
I1007 11:29:37.357810 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.357842 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.357875 1601536 hierarchical.cpp:1286] Performed allocation for 0 
agents in 127us
I1007 11:29:37.358070 1601536 hierarchical.cpp:485] Added agent agent1 (agent1) 
with cpus(*):1 (allocated: cpus(*):1)
I1007 11:29:37.358151 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.358165 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.358182 1601536 hierarchical.cpp:1309] Performed allocation for 
agent agent1 in 87us
I1007 11:29:37.358243 1601536 hierarchical.cpp:485] Added agent agent2 (agent2) 
with cpus(*):1 (allocated: cpus(*):1)
I1007 11:29:37.358337 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.358361 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.358373 1601536 hierarchical.cpp:1309] Performed allocation for 
agent agent2 in 102us
I1007 11:29:37.358554 1601536 hierarchical.cpp:275] Added framework framework2
I1007 11:29:37.358619 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.358649 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.358662 1601536 hierarchical.cpp:1286] Performed allocation for 2 
agents in 95us
I1007 11:29:37.358786 1064960 process.cpp:3377] Handling HTTP event for process 
'metrics' with path: '/metrics/snapshot'
[   OK ] HierarchicalAllocatorTest.NoDoubleAccounting (18 ms)
{noformat}

The test failed because allocation events are processed after the metrics 
event, meaning metrics do not contain information we are looking for. The fix 
would be to make sure allocation events are processed *before* querying metrics.

> CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
> -
>
> Key: MESOS-6321
> URL: https://issues.apache.org/jira/browse/MESOS-6321
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Observed in internal CI:
> {noformat}
> [15:52:21] : [Step 10/10] [ RUN  ] 
> HierarchicalAllocatorTest.NoDoubleAccounting
> [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 
> hierarchical.cpp:275] Added framework framework1
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] 
> Handling HTTP event for process 'metrics' with path: '/metrics/snapshot'
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 
> hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 
> hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 
> hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 
> hierarchical.cpp:275] Added framework framework2
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 
> hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns
> 

[jira] [Commented] (MESOS-2723) The mesos-execute tool does not support zk:// master URLs

2016-10-07 Thread Christian Parpart (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554589#comment-15554589
 ] 

Christian Parpart commented on MESOS-2723:
--

Hey,

I was just expecting --master flag to support zk URLs, too. So I ended up in 
this ticket. Can we bump the review again, somehow?

Best regards,
Christian.

> The mesos-execute tool does not support zk:// master URLs
> -
>
> Key: MESOS-2723
> URL: https://issues.apache.org/jira/browse/MESOS-2723
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.22.1
>Reporter: Tom Arnfeld
>Assignee: Tom Arnfeld
>
> It appears that the {{mesos-execute}} command line tool does it's own PID 
> validation of the {{--master}} param which prevents it from supporting 
> clusters managed with ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6327) Large docker images make the mesos containerizer crash with: Too many levels of symbolic links

2016-10-07 Thread Rogier Dikkes (JIRA)
Rogier Dikkes created MESOS-6327:


 Summary: Large docker images make the mesos containerizer crash 
with: Too many levels of symbolic links
 Key: MESOS-6327
 URL: https://issues.apache.org/jira/browse/MESOS-6327
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.0.1, 1.0.0
 Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in 
the Apache Aurora vagrant image
Reporter: Rogier Dikkes
Priority: Critical


When deploying Mesos containers with large (6G+, 60+ layers) Docker images the 
task crashes with the error: 


Mesos agent logs: 
E1007 08:40:12.954227  8117 slave.cpp:3976] Container 
'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor 
'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365'
 of framework df
c91a86-84b9-4539-a7be-4ace7b7b44a1- failed to start: Collect failed: 
Collect failed: Failed to copy layer: cp: cannot stat 
‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b
ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’:
 Too many levels of symbolic links
... (complete pastebin: http://pastebin.com/umZ4Q5d1 )

How to replicate:
Start the aurora vagrant image. Adjust the 
/etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file 
/vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker image 
instead of the example. (you can use anldisr/jupyter:0.4 i created as a test 
image, this is based upon the jupyter notebook stacks.). Create the job, watch 
it fail after x number of minutes. 

The mesos sandbox is empty. 

Aurora errors i see: 
28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect 
failed: Failed to copy layer: cp: cannot stat 
‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’:
 Too many levels of symbolic links cp: cannot stat ... 
Too many levels of symbolic links ; Container destroyed while provisioning 
images
(complete pastebin: http://pastebin.com/uecHYD5J )

To rule out the image i started this and more images as a normal Docker 
container. This works without issues. 

Mesos flags related configured: 
-appc_store_dir 
/tmp/mesos/images/appc
-containerizers 
docker,mesos
-executor_registration_timeout 
5mins
-image_providers 
appc,docker
-image_provisioner_backend 
copy
-isolation 
filesystem/linux,docker/runtime

Affected Mesos versions tested: 1.0.1 & 1.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6326) Build failed on Mac

2016-10-07 Thread Klaus Ma (JIRA)
Klaus Ma created MESOS-6326:
---

 Summary: Build failed on Mac
 Key: MESOS-6326
 URL: https://issues.apache.org/jira/browse/MESOS-6326
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.1
Reporter: Klaus Ma
Priority: Minor


Built Mesos 1.0.1 failed on Mac:

{{uname -a}}: Darwin Klauss-MacBook-Pro.local 16.0.0 Darwin Kernel Version 
16.0.0: Mon Aug 29 17:56:20 PDT 2016; root:xnu-3789.1.32~3/RELEASE_X86_64 x86_64

{code}
In file included from ../../src/appc/spec.cpp:19:
In file included from ../../3rdparty/stout/include/stout/protobuf.hpp:31:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/repeated_field.h:58:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/generated_message_util.h:44:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/once.h:81:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops.h:184:
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops_internals_macosx.h:164:10:
 error: 'OSAtomicAdd64Barrier' is deprecated:
  first deprecated in macOS 10.12 - Use std::atomic_fetch_add() from 
 instead [-Werror,-Wdeprecated-declarations]
  return OSAtomicAdd64Barrier(increment,
 ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libkern/OSAtomicDeprecated.h:247:9:
 note:
  'OSAtomicAdd64Barrier' has been explicitly marked deprecated here
int64_t OSAtomicAdd64Barrier( int64_t __theAmount,
^
In file included from ../../src/appc/spec.cpp:19:
In file included from ../../3rdparty/stout/include/stout/protobuf.hpp:31:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/repeated_field.h:58:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/generated_message_util.h:44:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/once.h:81:
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops.h:184:
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops_internals_macosx.h:173:9:
 error: 'OSAtomicCompareAndSwap64Barrier' is
  deprecated: first deprecated in macOS 10.12 - Use 
std::atomic_compare_exchange_strong() from  instead
  [-Werror,-Wdeprecated-declarations]
if (OSAtomicCompareAndSwap64Barrier(
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libkern/OSAtomicDeprecated.h:645:9:
 note:
  'OSAtomicCompareAndSwap64Barrier' has been explicitly marked deprecated 
here
boolOSAtomicCompareAndSwap64Barrier( int64_t __oldValue, int64_t __newValue,
^
12 errors generated.
make[2]: *** [appc/libmesos_no_3rdparty_la-spec.lo] Error 1
make[1]: *** [all] Error 2
make: *** [all-recursive] Error 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6325) Boolean member Executor::commandExecutor not always properly initialized

2016-10-07 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6325:
---

 Summary: Boolean member Executor::commandExecutor not always 
properly initialized
 Key: MESOS-6325
 URL: https://issues.apache.org/jira/browse/MESOS-6325
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Bannier


The constructor of {{Executor}} in {{src/slave/slave}} does not make sure that 
the member variable {{commandExecutor}} is always set. The following logic is 
used to determine its value,

{code}
Result executorPath =
  os::realpath(path::join(slave->flags.launcher_dir, MESOS_EXECUTOR));

if (executorPath.isSome()) {
  commandExecutor =
strings::contains(info.command().value(), executorPath.get());
}
{code}

Should we fail to determine the realpath of the mesos executor, 
{{commandExecutor}} will not be set. Since {{commandExecutor}} is a scalar 
field, no default initialization happens and its value will be random memory 
(which might often evaluate to {{true}}).

We need to make sure the member variable is set on all branches. Looking at the 
code it seems we might be able to just explicitly assert some {{executorPath}}.

This was pointed out by coverity, 
https://scan5.coverity.com/reports.htm#v10074/p10429/fileInstanceId=100298128=28784922=1373526.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)