[jira] [Commented] (MESOS-10034) Agent/executor domain socket communication

2020-01-13 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014308#comment-17014308
 ] 

Benno Evers commented on MESOS-10034:
-

During implementation, we reduced the scope of this epic to exclude custom 
executors running in containers with their own rootfs (obsoleting the 
MESOS-10060 and MESOS-10037 tickets)

I've attached the patches implementing that functionality to the respective 
tickets. The rest of the code has been committed to master and will land in 
Mesos 1.10.

> Agent/executor domain socket communication
> --
>
> Key: MESOS-10034
> URL: https://issues.apache.org/jira/browse/MESOS-10034
> Project: Mesos
>  Issue Type: Epic
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
> Fix For: 1.10
>
>
> Enable executors to communicate with Mesos agents via unix domain sockets.
> Design: 
> https://docs.google.com/document/d/1FzlIlK8542pgLSLwXBXp7Hf_nI5qQkkMsN3vkJD48L4/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10034) Agent/executor domain socket communication

2020-01-13 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10034:
---

   Fix Version/s: 1.10
Target Version/s: 1.10
Assignee: Benno Evers
  Resolution: Fixed

> Agent/executor domain socket communication
> --
>
> Key: MESOS-10034
> URL: https://issues.apache.org/jira/browse/MESOS-10034
> Project: Mesos
>  Issue Type: Epic
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
> Fix For: 1.10
>
>
> Enable executors to communicate with Mesos agents via unix domain sockets.
> Design: 
> https://docs.google.com/document/d/1FzlIlK8542pgLSLwXBXp7Hf_nI5qQkkMsN3vkJD48L4/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10086) Add support for systemd socket activation for mesos domain sockets

2020-01-13 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10086:
---

Fix Version/s: 1.10
 Assignee: Benno Evers
   Resolution: Fixed

{noformat}
commit 8a9eb868ff3979df1cb68ebf56cb451b411e06a9 (HEAD -> master, origin/master)
Author: Benno Evers 
Date:   Thu Jan 9 15:48:48 2020 +0100

Added systemd support to domain socket agent flag.

Added the ability to specify a unix domain socket
as `systemd:` for the `--domain_socket_location`
agent flag.

This will instruct the agent to expect the domain socket
being passed by systemd with the specified name.

Review: https://reviews.apache.org/r/71977

commit b73965de9874559c02bed42cd597e6ec678a461d
Author: Benno Evers 
Date:   Thu Jan 9 15:43:29 2020 +0100

Added support for systemd socket activation API.

Added support for the systemd socket activation api,
that allows systemd to pass listening file descriptors
to a given service.

Review: https://reviews.apache.org/r/71976
{noformat}

> Add support for systemd socket activation for mesos domain sockets
> --
>
> Key: MESOS-10086
> URL: https://issues.apache.org/jira/browse/MESOS-10086
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
> Fix For: 1.10
>
>
> We should implement support for systemd socket activation for the domain 
> socket used by agents and executors, so that it does not need to be removed 
> and re-created on agent startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10086) Add support for systemd socket activation for mesos domain sockets

2020-01-13 Thread Benno Evers (Jira)
Benno Evers created MESOS-10086:
---

 Summary: Add support for systemd socket activation for mesos 
domain sockets
 Key: MESOS-10086
 URL: https://issues.apache.org/jira/browse/MESOS-10086
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


We should implement support for systemd socket activation for the domain socket 
used by agents and executors, so that it does not need to be removed and 
re-created on agent startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10062) Implement relative path computation for stout

2020-01-07 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009722#comment-17009722
 ] 

Benno Evers commented on MESOS-10062:
-

{noformat}
commit 0e8b96ce90980e47adff9ecc5d7b1d31a688727b
Author: Benno Evers 
Date:   Tue Jan 7 13:31:16 2020 +0100

Added deprecated absolute() function for backwards compatibility.

Review: https://reviews.apache.org/r/71961/

commit 9d9d97c607359f5d2df95f6ef955054d80d130d1
Author: Benno Evers 
Date:   Tue Jan 7 13:30:58 2020 +0100

Added support for deprecated attribute to stout.

Added support for the [[deprecated]] attribute to stout.

Review: https://reviews.apache.org/r/71960/

commit 12e4ee81599f1b696f501901d246002310dcb9c1
Author: Benjamin Bannier 
Date:   Tue Jan 7 13:19:31 2020 +0100

Added a stout function to compute relative paths.

Review: https://reviews.apache.org/r/71882/

commit b32dd46549e8bd2761e18e0fae69f38858279ea9
Author: Benjamin Bannier 
Date:   Tue Jan 7 13:19:26 2020 +0100

Allowed specifying path separator in a `path::join` overload.

Review: https://reviews.apache.org/r/71881/

commit e31a25131ee345792ceda37354bc824bf038bc1c
Author: Benjamin Bannier 
Date:   Tue Jan 7 13:19:15 2020 +0100

Renamed stout's path-related absolute functions to is_absolute.

Review: https://reviews.apache.org/r/71880/

commit 85c5ca68922540ea4f8e74375ca0a636cf1cf67b
Author: Benjamin Bannier 
Date:   Tue Jan 7 13:19:10 2020 +0100

Renamed stout's path-related absolute functions to is_absolute.

Review: https://reviews.apache.org/r/71879/

commit ddf59b8b99b95fe3dc0e862f36dfeed2c0fd287a
Author: Benjamin Bannier 
Date:   Tue Jan 7 13:19:04 2020 +0100

Added iteration support to stout's Path.

Review: https://reviews.apache.org/r/71878/
{noformat}

> Implement relative path computation for stout
> -
>
> Key: MESOS-10062
> URL: https://issues.apache.org/jira/browse/MESOS-10062
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benjamin Bannier
>Priority: Major
> Fix For: 1.10
>
>
> When using executor domain sockets, we might need to specify relative paths 
> in order to stay below the path length limit of 108 characters.
> To do so, we should implement a `path::relative_path()` function in stout 
> that can compute the relative path between two directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10074) Adapt design for executor domain sockets for agent restarts

2019-12-18 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10074:
---

  Sprint: Studio 4: RI-21 61
Assignee: Benno Evers

> Adapt design for executor domain sockets for agent restarts
> ---
>
> Key: MESOS-10074
> URL: https://issues.apache.org/jira/browse/MESOS-10074
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> During testing, it was found that the proposed design for executor domain 
> sockets does not correctly handle agent restarts; in particular on a domain 
> socket-enabled agent all tasks running in containers with a separate root 
> filesystem would not have survived an agent reboot.
> We should change the design to fix that. (and implement the change in a 
> follow-up ticket)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10074) Adapt design for executor domain sockets for agent restarts

2019-12-18 Thread Benno Evers (Jira)
Benno Evers created MESOS-10074:
---

 Summary: Adapt design for executor domain sockets for agent 
restarts
 Key: MESOS-10074
 URL: https://issues.apache.org/jira/browse/MESOS-10074
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


During testing, it was found that the proposed design for executor domain 
sockets does not correctly handle agent restarts; in particular on a domain 
socket-enabled agent all tasks running in containers with a separate root 
filesystem would not have survived an agent reboot.

We should change the design to fix that. (and implement the change in a 
follow-up ticket)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9968) WWWAuthenticate header parsing fails when commas are in (quoted) realm

2019-12-05 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988921#comment-16988921
 ] 

Benno Evers commented on MESOS-9968:


1.8.x:
{noformat}
commit 21ec06ed44c1fbd2272081b20bdcee630759f52d
Author: Benjamin Bannier 
Date:   Mon Sep 23 10:24:50 2019 +0200

Fixed parsing of HTTP authentication headers.

This patch adds support for quoted strings in `Www-Authenticate` headers
and allows to use spaces when delimiting authentication attributes of
the form `WWW-Autenticate: a=b, c=d`, both of with are allowed by
RFC2617.

Review: https://reviews.apache.org/r/71534

commit 32d6937bee6c2c43d769daa7b95b33856b9b8364
Author: Benjamin Bannier 
Date:   Mon Sep 23 10:23:27 2019 +0200

Cleaned up `HTTPTest.WWWAuthenticateHeader`.

This patch removes a number of error-prone temporaries previously reused
in the test.

Review: https://reviews.apache.org/r/71533
{noformat}

1.9.x
{noformat}
commit 5fa73f683c38c025b0e650de24474e0fdf95d1f4
Author: Benjamin Bannier 
Date:   Mon Sep 23 10:24:50 2019 +0200

Fixed parsing of HTTP authentication headers.

This patch adds support for quoted strings in `Www-Authenticate` headers
and allows to use spaces when delimiting authentication attributes of
the form `WWW-Autenticate: a=b, c=d`, both of with are allowed by
RFC2617.

Review: https://reviews.apache.org/r/71534

commit 5f6d218a3123ec35b3a14ce20e72b5ca3594cef2
Author: Benjamin Bannier 
Date:   Mon Sep 23 10:23:27 2019 +0200

Cleaned up `HTTPTest.WWWAuthenticateHeader`.

This patch removes a number of error-prone temporaries previously reused
in the test.

Review: https://reviews.apache.org/r/71533
{noformat}

> WWWAuthenticate header parsing fails when commas are in (quoted) realm
> --
>
> Key: MESOS-9968
> URL: https://issues.apache.org/jira/browse/MESOS-9968
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: Jan Schlicht
>Assignee: Benjamin Bannier
>Priority: Major
> Fix For: 1.10.0
>
>
> This was discovered when trying to launch the 
> {{[nvcr.io/nvidia/tensorflow:19.08-py3|http://nvcr.io/nvidia/tensorflow:19.08-py3]}}
>  image using the Mesos containerizer. This launch fails with
> {noformat}
> Failed to launch container: Failed to get WWW-Authenticate header: Unexpected 
> auth-param format: 
> 'realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull' 
> in 
> 'realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push;'
> {noformat}
> This is because the [header tokenization in 
> libprocess|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L640]
>  can't handle commas in quoted realm values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10059) Let the command executor connect through a domain socket when available

2019-12-03 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987166#comment-16987166
 ] 

Benno Evers commented on MESOS-10059:
-

As it turns out, this will be resolved by MESOS-10039.

> Let the command executor connect through a domain socket when available
> ---
>
> Key: MESOS-10059
> URL: https://issues.apache.org/jira/browse/MESOS-10059
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> If the command executor is using the v1 API (--http_command_executors agent 
> flag) and the MESOS_DOMAIN_SOCKET environment variable is set, the command 
> executor should use the domain socket to communicate with the agent or die 
> trying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10061) Implement chmod() support for stout

2019-12-03 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987163#comment-16987163
 ] 

Benno Evers commented on MESOS-10061:
-

https://reviews.apache.org/r/71866/

> Implement chmod() support for stout
> ---
>
> Key: MESOS-10061
> URL: https://issues.apache.org/jira/browse/MESOS-10061
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> When using executor domain sockets, we need to be able to change permissions 
> on the domain socket to 0600. To do that, we should implement a new function 
> `os::chmod()` in stout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10061) Implement chmod() support for stout

2019-12-03 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987089#comment-16987089
 ] 

Benno Evers commented on MESOS-10061:
-

Reopening since `os::chown()` was just a typo for `os::chmod()`.

> Implement chmod() support for stout
> ---
>
> Key: MESOS-10061
> URL: https://issues.apache.org/jira/browse/MESOS-10061
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benjamin Bannier
>Priority: Major
>
> When using executor domain sockets, we need to be able to change permissions 
> on the domain socket to 0600. To do that, we should implement a new function 
> `os::chmod()` in stout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10060) Create code to bind-mount domain sockets into docker-type executor containers

2019-12-02 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10060:
---

Assignee: Benno Evers

> Create code to bind-mount domain sockets into docker-type executor containers
> -
>
> Key: MESOS-10060
> URL: https://issues.apache.org/jira/browse/MESOS-10060
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> Same as MESOS-10037, but for containers of type `DOCKER`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10062) Implement relative path computation for stout

2019-12-02 Thread Benno Evers (Jira)
Benno Evers created MESOS-10062:
---

 Summary: Implement relative path computation for stout
 Key: MESOS-10062
 URL: https://issues.apache.org/jira/browse/MESOS-10062
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


When using executor domain sockets, we might need to specify relative paths in 
order to stay below the path length limit of 108 characters.

To do so, we should implement a `path::relative_path()` function in stout that 
can compute the relative path between two directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10061) Implement chown() support for stout

2019-12-02 Thread Benno Evers (Jira)
Benno Evers created MESOS-10061:
---

 Summary: Implement chown() support for stout
 Key: MESOS-10061
 URL: https://issues.apache.org/jira/browse/MESOS-10061
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


When using executor domain sockets, we need to be able to change permissions on 
the domain socket to 0600. To do that, we should implement a new function 
`os::chown()` in stout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10060) Create code to bind-mount domain sockets into docker-type executor containers

2019-12-02 Thread Benno Evers (Jira)
Benno Evers created MESOS-10060:
---

 Summary: Create code to bind-mount domain sockets into docker-type 
executor containers
 Key: MESOS-10060
 URL: https://issues.apache.org/jira/browse/MESOS-10060
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


Same as MESOS-10037, but for containers of type `DOCKER`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10037) Create code to bind-mount domain sockets into mesos-type executor containers

2019-12-02 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985993#comment-16985993
 ] 

Benno Evers commented on MESOS-10037:
-

https://reviews.apache.org/r/71836/

> Create code to bind-mount domain sockets into mesos-type executor containers
> 
>
> Key: MESOS-10037
> URL: https://issues.apache.org/jira/browse/MESOS-10037
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> On an agent with domain socket communication enabled, when a new executor is 
> launched,  and when the executor is running in a container of type MESOS, the 
> agent should bind-mount the domain socket into the executor's root directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10059) Let the command executor connect through a domain socket when available

2019-12-02 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10059:
---

Assignee: (was: Benno Evers)

> Let the command executor connect through a domain socket when available
> ---
>
> Key: MESOS-10059
> URL: https://issues.apache.org/jira/browse/MESOS-10059
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Priority: Major
>
> If the command executor is using the v1 API (--http_command_executors agent 
> flag) and the MESOS_DOMAIN_SOCKET environment variable is set, the command 
> executor should use the domain socket to communicate with the agent or die 
> trying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10059) Let the command executor connect through a domain socket when available

2019-12-02 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10059:
---

Assignee: Benno Evers

> Let the command executor connect through a domain socket when available
> ---
>
> Key: MESOS-10059
> URL: https://issues.apache.org/jira/browse/MESOS-10059
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> If the command executor is using the v1 API (--http_command_executors agent 
> flag) and the MESOS_DOMAIN_SOCKET environment variable is set, the command 
> executor should use the domain socket to communicate with the agent or die 
> trying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10059) Let the command executor connect through a domain socket when available

2019-12-02 Thread Benno Evers (Jira)
Benno Evers created MESOS-10059:
---

 Summary: Let the command executor connect through a domain socket 
when available
 Key: MESOS-10059
 URL: https://issues.apache.org/jira/browse/MESOS-10059
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


If the command executor is using the v1 API (--http_command_executors agent 
flag) and the MESOS_DOMAIN_SOCKET environment variable is set, the command 
executor should use the domain socket to communicate with the agent or die 
trying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10058) Implement agent support for principals when listening on a domain socket

2019-12-02 Thread Benno Evers (Jira)
Benno Evers created MESOS-10058:
---

 Summary: Implement agent support for principals when listening on 
a domain socket
 Key: MESOS-10058
 URL: https://issues.apache.org/jira/browse/MESOS-10058
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


When the agent is listening for incoming connections from executors, we should 
support executor authentication by making sure the correct principal is passed 
to the HTTP handler function.

This also needs a unit test.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10058) Implement agent support for principals when listening on a domain socket

2019-12-02 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10058:
---

Assignee: Benjamin Bannier

> Implement agent support for principals when listening on a domain socket
> 
>
> Key: MESOS-10058
> URL: https://issues.apache.org/jira/browse/MESOS-10058
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benjamin Bannier
>Priority: Major
>
> When the agent is listening for incoming connections from executors, we 
> should support executor authentication by making sure the correct principal 
> is passed to the HTTP handler function.
> This also needs a unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10038) Implement agent code to listen on a domain socket

2019-11-27 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10038:
---

Assignee: Benjamin Bannier

> Implement agent code to listen on a domain socket
> -
>
> Key: MESOS-10038
> URL: https://issues.apache.org/jira/browse/MESOS-10038
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benjamin Bannier
>Priority: Major
>
> On an agent with executor domain sockets enabled, we need to implement code 
> such that the agent listens for incoming connections on its domain sockets, 
> and creates `Connection` objects through which executor <-> agent v1 
> communication can happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10037) Create code to bind-mount domain sockets into executor containers

2019-11-27 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10037:
---

Assignee: Benno Evers

> Create code to bind-mount domain sockets into executor containers
> -
>
> Key: MESOS-10037
> URL: https://issues.apache.org/jira/browse/MESOS-10037
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> On an agent with domain socket communication enabled, when a new executor is 
> launched, the agent should bind-mount the domain socket into the executor's 
> root directory.
> On a failure to create the mount, the task launch should fail with the new 
> reason `REASON_BIND_MOUNT_FAILED`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10036) Implement agent code to create a domain socket on startup

2019-11-27 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10036:
---

Assignee: Benno Evers

> Implement agent code to create a domain socket on startup
> -
>
> Key: MESOS-10036
> URL: https://issues.apache.org/jira/browse/MESOS-10036
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> When implementing the design proposed in 
> https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
>  , in the case where we enable domain socket communication we need some code 
> in the agent that checks on startup whether a domain socket already exists at 
> the location pointed to by flags.domain_socket_location, and if not creates a 
> new listening socket bound to that path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10036) Implement agent code to create a domain socket on startup

2019-11-27 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983447#comment-16983447
 ] 

Benno Evers commented on MESOS-10036:
-

https://reviews.apache.org/r/71832/
https://reviews.apache.org/r/71833/

> Implement agent code to create a domain socket on startup
> -
>
> Key: MESOS-10036
> URL: https://issues.apache.org/jira/browse/MESOS-10036
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Priority: Major
>
> When implementing the design proposed in 
> https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
>  , in the case where we enable domain socket communication we need some code 
> in the agent that checks on startup whether a domain socket already exists at 
> the location pointed to by MESOS_DOMAIN_SOCKET, and if not creates a new 
> listening socket bound to that path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10035) Implement `enable_http_executor_domain_sockets` agent flag

2019-11-25 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16981704#comment-16981704
 ] 

Benno Evers commented on MESOS-10035:
-

https://reviews.apache.org/r/71816/

> Implement `enable_http_executor_domain_sockets` agent flag
> --
>
> Key: MESOS-10035
> URL: https://issues.apache.org/jira/browse/MESOS-10035
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benjamin Bannier
>Priority: Major
>
> Based on the design in 
> https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
>  we need a `--enable_http_executor_domain_sockets[=true|false]` flag for the 
> mesos agent.
> The basic functionality we'd like for this task is, in pseudocode:
> {noformat}
> DURING task launch
> IF launching new executor && enable_http_executor_domain_sockets == True:
> Inject MESOS_DOMAIN_SOCKET environment variable pointing to 
> `/agent.sock` 
> {noformat}
>  
> Setting the environment variable can be done in the 
> `slave.cpp:executorEnvironment()` function.
> The code that actually creates the socket and puts it into the location 
> pointed to by `MESOS_DOMAIN_SOCKET` will be implemented in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10039) Let the built-in executors connect through a domain socket when available

2019-11-25 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16981637#comment-16981637
 ] 

Benno Evers commented on MESOS-10039:
-

https://reviews.apache.org/r/71814/
https://reviews.apache.org/r/71815/

> Let the built-in executors connect through a domain socket when available
> -
>
> Key: MESOS-10039
> URL: https://issues.apache.org/jira/browse/MESOS-10039
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> We should implement code in the default executor that checks for the presence 
> of the `MESOS_DOMAIN_SOCKET` environment variable, and if it is set attempts 
> to use that for communication with the agent as opposed to trying to open a 
> TCP connection.
> The same should be done for the command executor, when it is using the v1 API 
> to connect to the agent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10039) Let the built-in executors connect through a domain socket when available

2019-11-25 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10039:
---

Assignee: Benno Evers

> Let the built-in executors connect through a domain socket when available
> -
>
> Key: MESOS-10039
> URL: https://issues.apache.org/jira/browse/MESOS-10039
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>
> We should implement code in the default executor that checks for the presence 
> of the `MESOS_DOMAIN_SOCKET` environment variable, and if it is set attempts 
> to use that for communication with the agent as opposed to trying to open a 
> TCP connection.
> The same should be done for the command executor, when it is using the v1 API 
> to connect to the agent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10035) Implement `enable_http_executor_domain_sockets` agent flag

2019-11-22 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-10035:
---

  Sprint: Studio 4: RI-21 60
Story Points: 5
Assignee: Benjamin Bannier
 Description: 
Based on the design in 
https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
 we need a `--enable_http_executor_domain_sockets[=true|false]` flag for the 
mesos agent.

The basic functionality we'd like for this task is, in pseudocode:
{noformat}
DURING task launch
IF launching new executor && enable_http_executor_domain_sockets == True:
Inject MESOS_DOMAIN_SOCKET environment variable pointing to `/agent.sock` 
{noformat}
 
Setting the environment variable can be done in the 
`slave.cpp:executorEnvironment()` function.

The code that actually creates the socket and puts it into the location pointed 
to by `MESOS_DOMAIN_SOCKET` will be implemented in a separate ticket.

  was:
Based on the design in 
https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
 we need a `--enable_http_executor_domain_sockets[=true|false]` flag for the 
mesos agent.

The basic functionality we'd like for this task is, in pseudocode:
{noformat}
DURING task launch
IF launching new executor && enable_http_executor_domain_sockets == True:
Inject MESOS_DOMAIN_SOCKET environment variable pointing to `/agent.sock` 
{noformat}
 
The code that actually creates the socket and puts it into the location pointed 
to by `MESOS_DOMAIN_SOCKET` will be implemented in a separate ticket.


> Implement `enable_http_executor_domain_sockets` agent flag
> --
>
> Key: MESOS-10035
> URL: https://issues.apache.org/jira/browse/MESOS-10035
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Assignee: Benjamin Bannier
>Priority: Major
>
> Based on the design in 
> https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
>  we need a `--enable_http_executor_domain_sockets[=true|false]` flag for the 
> mesos agent.
> The basic functionality we'd like for this task is, in pseudocode:
> {noformat}
> DURING task launch
> IF launching new executor && enable_http_executor_domain_sockets == True:
> Inject MESOS_DOMAIN_SOCKET environment variable pointing to 
> `/agent.sock` 
> {noformat}
>  
> Setting the environment variable can be done in the 
> `slave.cpp:executorEnvironment()` function.
> The code that actually creates the socket and puts it into the location 
> pointed to by `MESOS_DOMAIN_SOCKET` will be implemented in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10039) Let the built-in executors connect through a domain socket when available

2019-11-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-10039:
---

 Summary: Let the built-in executors connect through a domain 
socket when available
 Key: MESOS-10039
 URL: https://issues.apache.org/jira/browse/MESOS-10039
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


We should implement code in the default executor that checks for the presence 
of the `MESOS_DOMAIN_SOCKET` environment variable, and if it is set attempts to 
use that for communication with the agent as opposed to trying to open a TCP 
connection.

The same should be done for the command executor, when it is using the v1 API 
to connect to the agent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10038) Implement agent code to listen on a domain socket

2019-11-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-10038:
---

 Summary: Implement agent code to listen on a domain socket
 Key: MESOS-10038
 URL: https://issues.apache.org/jira/browse/MESOS-10038
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


On an agent with executor domain sockets enabled, we need to implement code 
such that the agent listens for incoming connections on its domain sockets, and 
creates `Connection` objects through which executor <-> agent v1 communication 
can happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10037) Create code to bind-mount domain sockets into executor containers

2019-11-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-10037:
---

 Summary: Create code to bind-mount domain sockets into executor 
containers
 Key: MESOS-10037
 URL: https://issues.apache.org/jira/browse/MESOS-10037
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


On an agent with domain socket communication enabled, when a new executor is 
launched, the agent should bind-mount the domain socket into the executor's 
root directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10036) Implement agent code to create a domain socket on startup

2019-11-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-10036:
---

 Summary: Implement agent code to create a domain socket on startup
 Key: MESOS-10036
 URL: https://issues.apache.org/jira/browse/MESOS-10036
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


When implementing the design proposed in 
https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
 , in the case where we enable domain socket communication we need some code in 
the agent that checks on startup whether a domain socket already exists at the 
location pointed to by MESOS_DOMAIN_SOCKET, and if not creates a new listening 
socket bound to that path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10035) Implement `enable_http_executor_domain_sockets` agent flag

2019-11-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-10035:
---

 Summary: Implement `enable_http_executor_domain_sockets` agent flag
 Key: MESOS-10035
 URL: https://issues.apache.org/jira/browse/MESOS-10035
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


Based on the design in 
https://docs.google.com/document/d/1RUvjoBvM3UX_lLcq_J_crWpMMn3nO8CY0KWc655ELsM/edit
 we need:

 * A `--enable_http_executor_domain_sockets[=true|false]` flag
 * An `optional StringValue mesos_socket_location` field in the `ContainerInfo` 
protobuf

(note: the last one is still under discussion and has a high chance of being 
dropped from the final design, so ideally the commits will be structured such 
that it can easily be dropped from the code as well)

The basic functionality we'd like for this task is, in pseudocode:
{noformat}
DURING task launch
IF launching new executor && enable_http_executor_domain_sockets == True && 
!domain sockets disabled for this executor:
Inject MESOS_DOMAIN_SOCKET environment variable pointing to `/agent.sock` or `mesos_socket_location` into the executor environment.
{noformat}
 
The code that actually creates the socket and puts it into the location pointed 
to by `MESOS_DOMAIN_SOCKET` will be implemented in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10034) Agent/executor domain socket communication

2019-11-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-10034:
---

 Summary: Agent/executor domain socket communication
 Key: MESOS-10034
 URL: https://issues.apache.org/jira/browse/MESOS-10034
 Project: Mesos
  Issue Type: Epic
Reporter: Benno Evers


Enable executors to communicate with Mesos agents via unix domain sockets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9987) Update 'Master::Http::_reserve' to also require 'source' resources

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970179#comment-16970179
 ] 

Benno Evers commented on MESOS-9987:


{noformat}
commit b368d897d83df2f261e01fa7583798d80d098052
Author: Benno Evers 
Date:   Fri Nov 8 14:06:16 2019 +0100

Updated 'Master::Http::_reserve' to pass along new 'source' field.

Updated 'Master::Http::_reserve()' to correctly set the new `source`
field in the `Offer::Operation` created from operator API input.

Review: https://reviews.apache.org/r/71695/
{noformat}

> Update 'Master::Http::_reserve' to also require 'source' resources
> --
>
> Key: MESOS-9987
> URL: https://issues.apache.org/jira/browse/MESOS-9987
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
> Fix For: 1.10
>
>
> We need to always pass {{source}} into {{Master::Http::_reserve}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9986) Update 'getConsumedResources' and 'getResourceConversions' for 'source' in reservations

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970175#comment-16970175
 ] 

Benno Evers commented on MESOS-9986:


{noformat}
commit 1d225b4c0270f06b901f0fafd777a347aae921cd
Author: Benno Evers 
Date:   Fri Nov 8 14:19:11 2019 +0100

Updated 'getResourceConversion()' for reservation updates.

Updated the `getResourcesConversion()` function to correctly
handle the `source` field in `RESERVE` operations.

Review: https://reviews.apache.org/r/71719/
{noformat}

> Update 'getConsumedResources' and 'getResourceConversions' for 'source' in 
> reservations
> ---
>
> Key: MESOS-9986
> URL: https://issues.apache.org/jira/browse/MESOS-9986
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9991) Update 'Master::authorizeReserveResources' for re-reservations

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970167#comment-16970167
 ] 

Benno Evers commented on MESOS-9991:


{noformat}
commit 09c830d87b88d4c2f386cb9ded5931528d6cf144
Author: Benjamin Bannier 
Date:   Fri Nov 8 14:19:16 2019 +0100

Added authorization handling for reservations with `source`.

This patch adds authorization handling for `RESERVE` operations
containing `source` fields. In order to stay backwards-compatible we add
a dedicated authorization branch for such operations which under the
hood translates each removed reservation to an `UNRESERVE` operation and
every added reservation as a `RESERVE` operation where we fall back to
existing authorization code for authorization.

Review: https://reviews.apache.org/r/71729/
{noformat}

> Update 'Master::authorizeReserveResources' for re-reservations
> --
>
> Key: MESOS-9991
> URL: https://issues.apache.org/jira/browse/MESOS-9991
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: foundations
>
> We need to authorize all modifications to bring {{source}} to common 
> ancestor, and from common ancestor to {{resources}}.
>  * each removed authorizations needs to be authorized as an {{unreserve}} 
> operation
>  * each added reservation needs to be authorized as a {{reserve}} operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9992) Add end-to-end test excercising re-reservation operator API

2019-11-08 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970165#comment-16970165
 ] 

Benno Evers commented on MESOS-9992:


{noformat}
commit b6bdc74c896303dc1775c68642023ee4513834b1 (HEAD -> master, origin/master)
Author: Benno Evers 
Date:   Fri Nov 8 14:19:22 2019 +0100

Added end-to-end test for operator API reservation updates.

Added a new test to verify that reservations can be updated
using the operator API.

Review: https://reviews.apache.org/r/71725/
{noformat}

> Add end-to-end test excercising re-reservation operator API
> ---
>
> Key: MESOS-9992
> URL: https://issues.apache.org/jira/browse/MESOS-9992
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9992) Add end-to-end test excercising re-reservation operator API

2019-11-06 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9992:
--

Assignee: Benno Evers

> Add end-to-end test excercising re-reservation operator API
> ---
>
> Key: MESOS-9992
> URL: https://issues.apache.org/jira/browse/MESOS-9992
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9986) Update 'getConsumedResources' and 'getResourceConversions' for 'source' in reservations

2019-11-01 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9986:
--

Assignee: Benno Evers

> Update 'getConsumedResources' and 'getResourceConversions' for 'source' in 
> reservations
> ---
>
> Key: MESOS-9986
> URL: https://issues.apache.org/jira/browse/MESOS-9986
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9990) Consolidate 'Master::authorizeReserveResources' overloads

2019-10-29 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9990:
--

Assignee: Benno Evers

> Consolidate 'Master::authorizeReserveResources' overloads
> -
>
> Key: MESOS-9990
> URL: https://issues.apache.org/jira/browse/MESOS-9990
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>
> We should remove {{Master::authorizeReserveResources(Resources, 
> Option}} in favor of {{Master::authorizeReserveResources(Reserve, 
> Option)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9985) Update validation of 'ReserveResources' for 'source'

2019-10-29 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9985:
--

Assignee: Benno Evers

> Update validation of 'ReserveResources' for 'source'
> 
>
> Key: MESOS-9985
> URL: https://issues.apache.org/jira/browse/MESOS-9985
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>
> We need to update {{master::validation::master::call}} for {{source}}. In 
> particular we need to require that {{source}} and {{resources}} have a common 
> ancestor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9987) Update 'Master::Http::_reserve' to also require 'source' resources

2019-10-29 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9987:
--

Assignee: Benno Evers

> Update 'Master::Http::_reserve' to also require 'source' resources
> --
>
> Key: MESOS-9987
> URL: https://issues.apache.org/jira/browse/MESOS-9987
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>
> We need to always pass {{source}} into {{Master::Http::_reserve}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9989) Update 'Master::Http::_reserve' to pass 'source' into generated operation

2019-10-29 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9989:
--

Assignee: Benno Evers

> Update 'Master::Http::_reserve' to pass 'source' into generated operation
> -
>
> Key: MESOS-9989
> URL: https://issues.apache.org/jira/browse/MESOS-9989
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9984) Provide a function to compute a common "reservation ancestor" between two 'Resources'

2019-10-28 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9984:
--

Assignee: Benno Evers

> Provide a function to compute a common "reservation ancestor" between two 
> 'Resources'
> -
>
> Key: MESOS-9984
> URL: https://issues.apache.org/jira/browse/MESOS-9984
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>
> We need to provide a function to compute a common "reservation ancestor" 
> between two resources, {{Try getReservationAncestor(const 
> Resources&, const Resources&)}}.
> The common ancestor can be found by repeatedly popping dynamic reservations 
> from the full {{Resources}}.
> We should test the following cases:
>  * either LHS or RHS empty
>  * both empty -> empty ancestor
>  * {{STATIC}} reservations on path
>  * partially reserved LHS/RHS (partially reserved: not all {{Resource}} have 
> the same reservation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9988) Add 'source' field to scheduler reservation API

2019-10-28 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9988:
--

Assignee: Benno Evers

> Add 'source' field to scheduler reservation API
> ---
>
> Key: MESOS-9988
> URL: https://issues.apache.org/jira/browse/MESOS-9988
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9983) Intermediate rejection of Reserve operations with source set

2019-10-25 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9983:
--

Assignee: Benno Evers

> Intermediate rejection of Reserve operations with source set
> 
>
> Key: MESOS-9983
> URL: https://issues.apache.org/jira/browse/MESOS-9983
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>
> We need to update {{Master::authorizeReserveResources}} to reject any 
> {{Reserve}} operation whenever {{source}} is set until we have a proper 
> implementation in place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9982) Add a 'source' field to operator API ReserveResources protobuf

2019-10-25 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9982:
--

Assignee: Benno Evers

> Add a 'source' field to operator API ReserveResources protobuf
> --
>
> Key: MESOS-9982
> URL: https://issues.apache.org/jira/browse/MESOS-9982
> Project: Mesos
>  Issue Type: Task
>  Components: HTTP API
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9972) Update Names for TLS-related environment variables in libprocess.

2019-09-20 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934655#comment-16934655
 ] 

Benno Evers commented on MESOS-9972:


https://reviews.apache.org/r/71497/

[master]
{noformat}
commit 9f1d38f491e8d9c02bebb094da87467bb70a8d27
Author: Benno Evers 
Date:   Tue Sep 17 14:04:35 2019 +0200

Introduced new names for SSL-related libprocess flags.

The `LIBPROCESS_SSL_REQUIRE_CERT` flag was renamed to
`LIBPROCESS_SSL_REQUIRE_CLIENT_CERT`.

The `LIBPROCESS_SSL_VERIFY_CERT` flag was renamed to
`LIBPROCESS_SSL_VERIFY_SERVER_CERT`.

The new names better describe the actual effect of both flags, and
make upgrades easier by allowing operators to only enable verification
on agents that are new enough to contain the updated hostname
validation code paths.

Review: https://reviews.apache.org/r/71497
{noformat}

[1.9]
{noformat}
commit a8325853a01c2dd597fabe84c437ecfd46fb9c0c
Author: Benno Evers 
Date:   Tue Sep 17 14:04:35 2019 +0200

Introduced new names for SSL-related libprocess flags.

The `LIBPROCESS_SSL_REQUIRE_CERT` flag was renamed to
`LIBPROCESS_SSL_REQUIRE_CLIENT_CERT`.

The `LIBPROCESS_SSL_VERIFY_CERT` flag was renamed to
`LIBPROCESS_SSL_VERIFY_SERVER_CERT`.

The new names better describe the actual effect of both flags, and
make upgrades easier by allowing operators to only enable verification
on agents that are new enough to contain the updated hostname
validation code paths.

Review: https://reviews.apache.org/r/71497
{noformat}

> Update Names for TLS-related environment variables in libprocess.
> -
>
> Key: MESOS-9972
> URL: https://issues.apache.org/jira/browse/MESOS-9972
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: libprocess, ssl, tls
> Fix For: 1.10, 1.9.1
>
>
> The environment variables `LIBPROCESS_SSL_VERIFY_CERT` and 
> `LIBPROCESS_SSL_REQUIRE_CERT` regularly cause confusion because they do not 
> precisely describe their function.
> In particular, one might mistakenly assume that certificates are not required 
> when setting `LIBPROCESS_SSL_REQUIRE_CERT=false`, or that all certificates 
> are verified when `LIBPROCESS_SSL_VERIFY_CERT=true`.
> We should rename the options to `LIBPROCESS_SSL_VERIFY_SERVER_CERT` and 
> `LIBPROCESS_SSL_REQUIRE_CLIENT_CERT` to make the semantics more clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9972) Update Names for TLS-related environment variables in libprocess.

2019-09-20 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9972:
--

Fix Version/s: 1.9.1
   1.10
 Assignee: Benno Evers
   Resolution: Fixed

> Update Names for TLS-related environment variables in libprocess.
> -
>
> Key: MESOS-9972
> URL: https://issues.apache.org/jira/browse/MESOS-9972
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: libprocess, ssl, tls
> Fix For: 1.10, 1.9.1
>
>
> The environment variables `LIBPROCESS_SSL_VERIFY_CERT` and 
> `LIBPROCESS_SSL_REQUIRE_CERT` regularly cause confusion because they do not 
> precisely describe their function.
> In particular, one might mistakenly assume that certificates are not required 
> when setting `LIBPROCESS_SSL_REQUIRE_CERT=false`, or that all certificates 
> are verified when `LIBPROCESS_SSL_VERIFY_CERT=true`.
> We should rename the options to `LIBPROCESS_SSL_VERIFY_SERVER_CERT` and 
> `LIBPROCESS_SSL_REQUIRE_CLIENT_CERT` to make the semantics more clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-9973) Remove Deprecated Names for libprocess TLS flags

2019-09-18 Thread Benno Evers (Jira)
Benno Evers created MESOS-9973:
--

 Summary: Remove Deprecated Names for libprocess TLS flags
 Key: MESOS-9973
 URL: https://issues.apache.org/jira/browse/MESOS-9973
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


The  names `LIBPROCESS_SSL_VERIFY_CERT`  and `LIBPROCESS_SSL_REQUIRE_CERT` will 
become  deprecated when  https://reviews.apache.org/r/71497 lands.

They should be removed at some point.

NOTE: This ticket is just to satisfy bureaucracy, I don't think we should 
actually remove the old names when we release Mesos 2.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-9972) Update Names for TLS-related environment variables in libprocess.

2019-09-18 Thread Benno Evers (Jira)
Benno Evers created MESOS-9972:
--

 Summary: Update Names for TLS-related environment variables in 
libprocess.
 Key: MESOS-9972
 URL: https://issues.apache.org/jira/browse/MESOS-9972
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The environment variables `LIBPROCESS_SSL_VERIFY_CERT` and 
`LIBPROCESS_SSL_REQUIRE_CERT` regularly cause confusion because they do not 
precisely describe their function.

In particular, one might mistakenly assume that certificates are not required 
when setting `LIBPROCESS_SSL_REQUIRE_CERT=false`, or that all certificates are 
verified when `LIBPROCESS_SSL_VERIFY_CERT=true`.

We should rename the options to `LIBPROCESS_SSL_VERIFY_SERVER_CERT` and 
`LIBPROCESS_SSL_REQUIRE_CLIENT_CERT` to make the semantics more clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9879) Create a unit test ensuring that a client certificate requests are properly ignored

2019-09-17 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931390#comment-16931390
 ] 

Benno Evers commented on MESOS-9879:


Given that the behaviour described here is mandated by the TLS spec and testing 
it would require implementing a custom, buggy TLS implementation, I think it's 
safe to say the costs outweigh the benefits here. Closing this as "Wont fix".

> Create a unit test ensuring that a client certificate requests are properly 
> ignored
> ---
>
> Key: MESOS-9879
> URL: https://issues.apache.org/jira/browse/MESOS-9879
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess, ssl, tls
>
> When a TLS server sends a Client Certificate Request as part of the handshake 
> and the client does not have a certificate available, the TLS specification 
> mandates that the client shall attempt to continue the connection attempt 
> sending a zero-length certificate.
> We should write a unit test verifying libprocess handles this correctly when 
> acting as a client, although it's not completely clear how this might be 
> implemented.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9960) Agent with cgroup support may destroy containers belonging to unrelated agents on startup

2019-09-12 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928365#comment-16928365
 ] 

Benno Evers commented on MESOS-9960:


[~gilbert], I'm not sure I follow, why do you think it should be closed?

> Agent with cgroup support may destroy containers belonging to unrelated 
> agents on startup
> -
>
> Key: MESOS-9960
> URL: https://issues.apache.org/jira/browse/MESOS-9960
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.1, 1.9.0, master
>Reporter: Benno Evers
>Priority: Major
>
> Let's say I have a mesos cluster with one master and one agent:
> {noformat}
> $ mesos-master --work_dir=/tmp/mesos-master
> $ sudo mesos-agent --work_dir=/tmp/mesos-agent --master=127.0.1.1:5050 
> --port=5052 --isolation=docker/runtime
> {noformat}
> where I'm running a simple sleep task:
> {noformat}
> $ mesos-execute --command="sleep 1" --master=127.0.1.1:5050 --name="sleep"
> I0904 18:40:25.020413 18321 scheduler.cpp:189] Version: 1.8.0
> I0904 18:40:25.020892 18319 scheduler.cpp:342] Using default 'basic' HTTP 
> authenticatee
> I0904 18:40:25.021039 18323 scheduler.cpp:525] New master detected at 
> master@127.0.1.1:5050
> Subscribed with ID 7d9f5030-cadd-49df-bf1e-daa97a4baab6-
> Submitted task 'sleep' to agent 'd59e934c-9e26-490d-9f4a-1e8b4ce06b4e-S1'
> Received status update TASK_STARTING for task 'sleep'
>   source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task 'sleep'
>   source: SOURCE_EXECUTOR
> {noformat}
> Next, I start a second agent  on the same host as the first one:
> {noformat}
> $ sudo ./src/mesos-agent --work_dir=/tmp/ --master=example.org:5050 
> --isolation="linux/seccomp" 
> --seccomp_config_dir=`pwd`/3rdparty/libseccomp-2.3.3
> {noformat}
> During startup, this agent detects the container belonging to the other, 
> unrelated agent and will attempt to clean it up:
> {noformat}
> 0904 18:30:44.906430 18067 task_status_update_manager.cpp:207] Recovering 
> task status update manager
> I0904 18:30:44.906913 18071 containerizer.cpp:797] Recovering Mesos containers
> I0904 18:30:44.910077 18070 linux_launcher.cpp:286] Recovering Linux launcher
> I0904 18:30:44.910347 18070 linux_launcher.cpp:343] Recovered container 
> 7f455ed7-6593-41e8-9b29-52ee84d7675b
> I0904 18:30:44.910409 18070 linux_launcher.cpp:437] 
> 7f455ed7-6593-41e8-9b29-52ee84d7675b is a known orphaned container
> I0904 18:30:44.910877 18065 containerizer.cpp:1123] Recovering isolators
> I0904 18:30:44.911888 18064 containerizer.cpp:1162] Recovering provisioner
> I0904 18:30:44.913368 18068 provisioner.cpp:498] Provisioner recovery complete
> I0904 18:30:44.913630 18065 containerizer.cpp:1234] Cleaning up orphan 
> container 7f455ed7-6593-41e8-9b29-52ee84d7675b
> I0904 18:30:44.913656 18065 containerizer.cpp:2576] Destroying container 
> 7f455ed7-6593-41e8-9b29-52ee84d7675b in RUNNING state
> I0904 18:30:44.913666 18065 containerizer.cpp:3278] Transitioning the state 
> of container 7f455ed7-6593-41e8-9b29-52ee84d7675b from RUNNING to DESTROYING
> I0904 18:30:44.914687 18064 linux_launcher.cpp:576] Asked to destroy 
> container 7f455ed7-6593-41e8-9b29-52ee84d7675b
> I0904 18:30:44.914788 18064 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/7f455ed7-6593-41e8-9b29-52ee84d7675b'
> {noformat}
> killing the sleep task in the process:
> {noformat}
> Received status update TASK_FAILED for task 'sleep'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {noformat}
> After some additional testing, it seems like the value of the `--isolation` 
> flag is actually irrelevant: The same behaviour can be observed as long as 
> cgroup support is enabled with `--systemd_enable_support`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9960) Agent with cgroup support may destroy containers belonging to unrelated agents on startup

2019-09-04 Thread Benno Evers (Jira)
Benno Evers created MESOS-9960:
--

 Summary: Agent with cgroup support may destroy containers 
belonging to unrelated agents on startup
 Key: MESOS-9960
 URL: https://issues.apache.org/jira/browse/MESOS-9960
 Project: Mesos
  Issue Type: Bug
Affects Versions: master, 1.8.1, 1.9.0
Reporter: Benno Evers


Let's say I have a mesos cluster with one master and one agent:
{noformat}
$ mesos-master --work_dir=/tmp/mesos-master
$ sudo mesos-agent --work_dir=/tmp/mesos-agent --master=127.0.1.1:5050 
--port=5052 --isolation=docker/runtime
{noformat}

where I'm running a simple sleep task:
{noformat}
$ mesos-execute --command="sleep 1" --master=127.0.1.1:5050 --name="sleep"
I0904 18:40:25.020413 18321 scheduler.cpp:189] Version: 1.8.0
I0904 18:40:25.020892 18319 scheduler.cpp:342] Using default 'basic' HTTP 
authenticatee
I0904 18:40:25.021039 18323 scheduler.cpp:525] New master detected at 
master@127.0.1.1:5050
Subscribed with ID 7d9f5030-cadd-49df-bf1e-daa97a4baab6-
Submitted task 'sleep' to agent 'd59e934c-9e26-490d-9f4a-1e8b4ce06b4e-S1'
Received status update TASK_STARTING for task 'sleep'
  source: SOURCE_EXECUTOR
Received status update TASK_RUNNING for task 'sleep'
  source: SOURCE_EXECUTOR
{noformat}


Next, I start a second agent  on the same host as the first one:
{noformat}
$ sudo ./src/mesos-agent --work_dir=/tmp/ --master=example.org:5050 
--isolation="linux/seccomp" --seccomp_config_dir=`pwd`/3rdparty/libseccomp-2.3.3
{noformat}

During startup, this agent detects the container belonging to the other, 
unrelated agent and will attempt to clean it up:
{noformat}
0904 18:30:44.906430 18067 task_status_update_manager.cpp:207] Recovering task 
status update manager
I0904 18:30:44.906913 18071 containerizer.cpp:797] Recovering Mesos containers
I0904 18:30:44.910077 18070 linux_launcher.cpp:286] Recovering Linux launcher
I0904 18:30:44.910347 18070 linux_launcher.cpp:343] Recovered container 
7f455ed7-6593-41e8-9b29-52ee84d7675b
I0904 18:30:44.910409 18070 linux_launcher.cpp:437] 
7f455ed7-6593-41e8-9b29-52ee84d7675b is a known orphaned container
I0904 18:30:44.910877 18065 containerizer.cpp:1123] Recovering isolators
I0904 18:30:44.911888 18064 containerizer.cpp:1162] Recovering provisioner
I0904 18:30:44.913368 18068 provisioner.cpp:498] Provisioner recovery complete
I0904 18:30:44.913630 18065 containerizer.cpp:1234] Cleaning up orphan 
container 7f455ed7-6593-41e8-9b29-52ee84d7675b
I0904 18:30:44.913656 18065 containerizer.cpp:2576] Destroying container 
7f455ed7-6593-41e8-9b29-52ee84d7675b in RUNNING state
I0904 18:30:44.913666 18065 containerizer.cpp:3278] Transitioning the state of 
container 7f455ed7-6593-41e8-9b29-52ee84d7675b from RUNNING to DESTROYING
I0904 18:30:44.914687 18064 linux_launcher.cpp:576] Asked to destroy container 
7f455ed7-6593-41e8-9b29-52ee84d7675b
I0904 18:30:44.914788 18064 linux_launcher.cpp:618] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/7f455ed7-6593-41e8-9b29-52ee84d7675b'
{noformat}


killing the sleep task in the process:
{noformat}
Received status update TASK_FAILED for task 'sleep'
  message: 'Executor terminated'
  source: SOURCE_AGENT
  reason: REASON_EXECUTOR_TERMINATED
{noformat}

After some additional testing, it seems like the value of the `--isolation` 
flag is actually irrelevant: The same behaviour can be observed as long as 
cgroup support is enabled with `--systemd_enable_support`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9956) CSI plugins reporting duplicated volumes will crash the agent.

2019-08-30 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919445#comment-16919445
 ] 

Benno Evers commented on MESOS-9956:


{noformat}
commit 43b86da531a889b1c4b1d7ca6acb2eb924ea01e1 (origin/master, master)
Author: Chun-Hung Hsiao 
Date:   Fri Aug 30 13:04:22 2019 +0200

Gracefully handled duplicated volumes from non-conforming CSI plugins.

If the SLRP uses a plugin that does not conform to the CSI spec and
reports duplicated volumes, the duplicate would be removed.

Review: https://reviews.apache.org/r/71414/


commit b18ce53fe8e49e6f030efe89e0976a9f72ad8b50 (1.9.x, origin/1.9.x)
Author: Chun-Hung Hsiao 
Date:   Fri Aug 30 13:05:37 2019 +0200

Gracefully handled duplicated volumes from non-conforming CSI plugins.

If the SLRP uses a plugin that does not conform to the CSI spec and
reports duplicated volumes, the duplicate would be removed.

Review: https://reviews.apache.org/r/71414/
{noformat}

> CSI plugins reporting duplicated volumes will crash the agent.
> --
>
> Key: MESOS-9956
> URL: https://issues.apache.org/jira/browse/MESOS-9956
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, storage
>
> The CSI spec requires volumes to be uniquely identifiable by ID, and thus 
> SLRP currently assumes that a {{ListVolumes}} call does not return duplicated 
> volumes. However, if a SLRP uses a non-conforming CSI plugin that reports 
> duplicated volumes, these volumes would corrupt the SLRP checkpoint and cause 
> the agent to crash at the next reconciliation:
> {noformat}
>  F0829 07:13:55.171332 12721 provider.cpp:1089] Check failed: 
> !checkpointedMap.contains(resource.disk().source().id()){noformat}
> MESOS-9254 introduces periodic reconciliation which make this problem much 
> easier to manifest.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-9947) Java bindings sporadically fail to build

2019-08-27 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9947:
--

  Assignee: Benno Evers
Resolution: Fixed

> Java bindings sporadically fail to build
> 
>
> Key: MESOS-9947
> URL: https://issues.apache.org/jira/browse/MESOS-9947
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: maven
>
> We sporadically (maybe once a month?) observe build failures in the java 
> bindings in our internal CI. They look like this:
> {noformat}
> 14:32:18 [ERROR] 
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
>  error: cannot access StringBuilder
> 14:32:18 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile 
> (default-compile) on project mesos: Compilation failure
> 14:32:18 [ERROR] 
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
>  error: cannot access StringBuilder
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9947) Java bindings sporadically fail to build

2019-08-27 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916690#comment-16916690
 ] 

Benno Evers commented on MESOS-9947:


As it turns out, this was most likely an issue with out CI setup: The ubuntu 
image we used for running the tests had an automatic update service enabled, 
which sometimes would update the jdk right when the java bindings were compiled.

I've reverted the change above on the master branch.

> Java bindings sporadically fail to build
> 
>
> Key: MESOS-9947
> URL: https://issues.apache.org/jira/browse/MESOS-9947
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: maven
>
> We sporadically (maybe once a month?) observe build failures in the java 
> bindings in our internal CI. They look like this:
> {noformat}
> 14:32:18 [ERROR] 
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
>  error: cannot access StringBuilder
> 14:32:18 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile 
> (default-compile) on project mesos: Compilation failure
> 14:32:18 [ERROR] 
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
>  error: cannot access StringBuilder
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9947) Java bindings sporadically fail to build

2019-08-21 Thread Benno Evers (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912277#comment-16912277
 ] 

Benno Evers commented on MESOS-9947:


Diagnostics review: https://reviews.apache.org/r/71337/

> Java bindings sporadically fail to build
> 
>
> Key: MESOS-9947
> URL: https://issues.apache.org/jira/browse/MESOS-9947
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: maven
>
> We sporadically (maybe once a month?) observe build failures in the java 
> bindings in our internal CI. They look like this:
> {noformat}
> 14:32:18 [ERROR] 
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
>  error: cannot access StringBuilder
> 14:32:18 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile 
> (default-compile) on project mesos: Compilation failure
> 14:32:18 [ERROR] 
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
>  error: cannot access StringBuilder
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9947) Java bindings sporadically fail to build

2019-08-21 Thread Benno Evers (Jira)
Benno Evers created MESOS-9947:
--

 Summary: Java bindings sporadically fail to build
 Key: MESOS-9947
 URL: https://issues.apache.org/jira/browse/MESOS-9947
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


We sporadically (maybe once a month?) observe build failures in the java 
bindings in our internal CI. They look like this:

{noformat}
14:32:18 [ERROR] 
/home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
 error: cannot access StringBuilder
14:32:18 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) 
on project mesos: Compilation failure
14:32:18 [ERROR] 
/home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-ubuntu-16.04/mesos/build/src/java/generated/org/apache/mesos/Protos.java:[14594,45]
 error: cannot access StringBuilder
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9928) OperationReconciliationTest.FrameworkReconciliationRaceWithUpdateSlaveMessage is severely flaky

2019-08-16 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909073#comment-16909073
 ] 

Benno Evers commented on MESOS-9928:


https://reviews.apache.org/r/71297/

> OperationReconciliationTest.FrameworkReconciliationRaceWithUpdateSlaveMessage 
> is severely flaky
> ---
>
> Key: MESOS-9928
> URL: https://issues.apache.org/jira/browse/MESOS-9928
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Andrei Sekretenko
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky, flaky-test, foundations
>
> Flakes are frequently observed in the internal CI.
> Example:
> {code}
> [ RUN  ] 
> ContentType/OperationReconciliationTest.FrameworkReconciliationRaceWithUpdateSlaveMessage/1
> I0806 20:00:24.128456 29945 cluster.cpp:177] Creating default 'local' 
> authorizer
> I0806 20:00:24.132164 21364 master.cpp:440] Master 
> 7bbcb55d-ce3b-40e6-a605-62ed7d843832 (ip-172-16-10-6.ec2.internal) started on 
> 172.16.10.6:36902
> I0806 20:00:24.132181 21364 master.cpp:443] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/MpmzC4/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/MpmzC4/master" --zk_session_timeout="10secs"
> I0806 20:00:24.132485 21364 master.cpp:492] Master only allowing 
> authenticated frameworks to register
> I0806 20:00:24.132494 21364 master.cpp:498] Master only allowing 
> authenticated agents to register
> I0806 20:00:24.132500 21364 master.cpp:504] Master only allowing 
> authenticated HTTP frameworks to register
> I0806 20:00:24.132506 21364 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/MpmzC4/credentials'
> I0806 20:00:24.132709 21364 master.cpp:548] Using default 'crammd5' 
> authenticator
> I0806 20:00:24.132845 21364 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0806 20:00:24.132975 21364 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0806 20:00:24.133085 21364 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0806 20:00:24.133188 21364 master.cpp:629] Authorization enabled
> I0806 20:00:24.135308 21363 whitelist_watcher.cpp:77] No whitelist given
> I0806 20:00:24.139948 21364 master.cpp:2168] Elected as the leading master!
> I0806 20:00:24.139968 21364 master.cpp:1664] Recovering from registrar
> I0806 20:00:24.140195 21364 registrar.cpp:339] Recovering registrar
> I0806 20:00:24.141042 21364 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 0ns
> I0806 20:00:24.141141 21364 registrar.cpp:487] Applied 1 operations in 
> 25620ns; attempting to update the registry
> I0806 20:00:24.141793 21364 registrar.cpp:544] Successfully updated the 
> registry in 0ns
> I0806 20:00:24.141894 21364 registrar.cpp:416] Successfully recovered 
> registrar
> I0806 20:00:24.142277 21364 master.cpp:1817] Recovered 0 agents from the 
> registry (175B); allowing 10mins for agents to reregister
> I0806 20:00:24.142611 21366 hierarchical.cpp:241] Initialized hierarchical 
> allocator process
> I0806 20:00:24.142735 21366 hierarchical.cpp:280] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0806 20:00:24.147953 29945 process.cpp:2877] Attempted to spawn already 
> running process 

[jira] [Assigned] (MESOS-9928) OperationReconciliationTest.FrameworkReconciliationRaceWithUpdateSlaveMessage is severely flaky

2019-08-16 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9928:
--

Assignee: Benno Evers

> OperationReconciliationTest.FrameworkReconciliationRaceWithUpdateSlaveMessage 
> is severely flaky
> ---
>
> Key: MESOS-9928
> URL: https://issues.apache.org/jira/browse/MESOS-9928
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Andrei Sekretenko
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky, flaky-test, foundations
>
> Flakes are frequently observed in the internal CI.
> Example:
> {code}
> [ RUN  ] 
> ContentType/OperationReconciliationTest.FrameworkReconciliationRaceWithUpdateSlaveMessage/1
> I0806 20:00:24.128456 29945 cluster.cpp:177] Creating default 'local' 
> authorizer
> I0806 20:00:24.132164 21364 master.cpp:440] Master 
> 7bbcb55d-ce3b-40e6-a605-62ed7d843832 (ip-172-16-10-6.ec2.internal) started on 
> 172.16.10.6:36902
> I0806 20:00:24.132181 21364 master.cpp:443] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/MpmzC4/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_operator_event_stream_subscribers="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
> --publish_per_framework_metrics="true" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/MpmzC4/master" --zk_session_timeout="10secs"
> I0806 20:00:24.132485 21364 master.cpp:492] Master only allowing 
> authenticated frameworks to register
> I0806 20:00:24.132494 21364 master.cpp:498] Master only allowing 
> authenticated agents to register
> I0806 20:00:24.132500 21364 master.cpp:504] Master only allowing 
> authenticated HTTP frameworks to register
> I0806 20:00:24.132506 21364 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/MpmzC4/credentials'
> I0806 20:00:24.132709 21364 master.cpp:548] Using default 'crammd5' 
> authenticator
> I0806 20:00:24.132845 21364 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0806 20:00:24.132975 21364 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0806 20:00:24.133085 21364 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0806 20:00:24.133188 21364 master.cpp:629] Authorization enabled
> I0806 20:00:24.135308 21363 whitelist_watcher.cpp:77] No whitelist given
> I0806 20:00:24.139948 21364 master.cpp:2168] Elected as the leading master!
> I0806 20:00:24.139968 21364 master.cpp:1664] Recovering from registrar
> I0806 20:00:24.140195 21364 registrar.cpp:339] Recovering registrar
> I0806 20:00:24.141042 21364 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 0ns
> I0806 20:00:24.141141 21364 registrar.cpp:487] Applied 1 operations in 
> 25620ns; attempting to update the registry
> I0806 20:00:24.141793 21364 registrar.cpp:544] Successfully updated the 
> registry in 0ns
> I0806 20:00:24.141894 21364 registrar.cpp:416] Successfully recovered 
> registrar
> I0806 20:00:24.142277 21364 master.cpp:1817] Recovered 0 agents from the 
> registry (175B); allowing 10mins for agents to reregister
> I0806 20:00:24.142611 21366 hierarchical.cpp:241] Initialized hierarchical 
> allocator process
> I0806 20:00:24.142735 21366 hierarchical.cpp:280] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0806 20:00:24.147953 29945 process.cpp:2877] Attempted to spawn already 
> running process files@172.16.10.6:36902
> I0806 20:00:24.149081 

[jira] [Comment Edited] (MESOS-9339) SSL (TLS) peer reverse DNS lookup can block the event loop thread.

2019-08-12 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905433#comment-16905433
 ] 

Benno Evers edited comment on MESOS-9339 at 8/12/19 5:54 PM:
-

This is partially resolved in Mesos 1.9 by https://reviews.apache.org/r/70749/ 
, which eliminates rDNS lookups for incoming TLS connections when setting 
`LIBPROCESS_SSL_HOSTNAME_VALIDATION_SCHEME=openssl`.

We can probably close this once we change the default for that setting from 
`legacy` to `openssl`.


was (Author: bennoe):
This is partially resolved in Mesos 1.9 by https://reviews.apache.org/r/70749/ 
, which eliminates rDNS lookups for incoming TLS connections when setting 
`LIBPROCESS_SSL_HOSTNAME_VALIDATION_SCHEME=openssl`.

We can probably close this once we change the default for that ticket from 
`legacy` to `openssl`.

> SSL (TLS) peer reverse DNS lookup can block the event loop thread.
> --
>
> Key: MESOS-9339
> URL: https://issues.apache.org/jira/browse/MESOS-9339
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> We currently look up the peer hostname in order to perform certificate 
> verification while accepting SSL (TLS) connections. This blocks the event 
> loop thread in cases where it has to go over the network. We saw one issue 
> where a misconfiguration meant that this would block for 15 seconds.
> Once we add asynchronous DNS lookup facilities (MESOS-9338), we can use them 
> to avoid blocking the event loop thread.
> We should consider logging slow DNS reverse lookups and adding timing metrics 
> for them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9339) SSL (TLS) peer reverse DNS lookup can block the event loop thread.

2019-08-12 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905433#comment-16905433
 ] 

Benno Evers commented on MESOS-9339:


This is partially resolved in Mesos 1.9 by https://reviews.apache.org/r/70749/ 
, which eliminates rDNS lookups for incoming TLS connections when setting 
`LIBPROCESS_SSL_HOSTNAME_VALIDATION_SCHEME=openssl`.

We can probably close this once we change the default for that ticket from 
`legacy` to `openssl`.

> SSL (TLS) peer reverse DNS lookup can block the event loop thread.
> --
>
> Key: MESOS-9339
> URL: https://issues.apache.org/jira/browse/MESOS-9339
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> We currently look up the peer hostname in order to perform certificate 
> verification while accepting SSL (TLS) connections. This blocks the event 
> loop thread in cases where it has to go over the network. We saw one issue 
> where a misconfiguration meant that this would block for 15 seconds.
> Once we add asynchronous DNS lookup facilities (MESOS-9338), we can use them 
> to avoid blocking the event loop thread.
> We should consider logging slow DNS reverse lookups and adding timing metrics 
> for them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9811) Don't use reverse DNS for hostname validation

2019-07-05 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9811:
--

   Resolution: Fixed
 Assignee: Benno Evers
Fix Version/s: 1.9.0

{noformat}
commit 0a081e01a3f4af8141a8085ed2f97ee85ea48fe1
Author: Benno Evers 
Date:   Wed Jun 19 15:49:11 2019 +0200

Introduced RFC6125-compliant hostname validation scheme.

This commit introduces a new libprocess SSL flag
`hostname_validation_scheme`, which can be set to 'legacy'
to select the previous hostname validation behaviour or to
'openssl' to use standardized OpenSSL algorithms to handle
hostname validation as part of the TLS handshake.

As a nice side-effect, the new scheme gets rid of reverse DNS
lookups during TLS connection establishment, which used to be
a common source of hard-to-debug unresponsiveness in Mesos
components.

See `docs/ssl.md` in the follow-up commit for details of and
differences between the schemes.

Review: https://reviews.apache.org/r/70749
{noformat}

> Don't use reverse DNS for hostname validation
> -
>
> Key: MESOS-9811
> URL: https://issues.apache.org/jira/browse/MESOS-9811
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations, libprocess, ssl
> Fix For: 1.9.0
>
>
> Upon connection we first resolve the hostname and forget about it
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1462-L1504
> then later use reverse DNS on the remote address to get back a hostname
> https://github.com/apache/mesos/blob/4708c2a368e12a89669135f4d0dd05d9b0b2/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L548-L556
> and verify the server certificate against *that*.
> Instead, we should verify the server certificate against the hostname that 
> was used by t he client to initiate the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9867) Libevent fd cleanup failure may cause hangs in combination with client certificate validation

2019-07-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879280#comment-16879280
 ] 

Benno Evers commented on MESOS-9867:


Since this requires very specific conditions in order to happen, I just added a 
warning if these conditions are met.

Over time, this will solve itself as people are using newer and newer libevent 
versions.

{noformat}
commit 1a6760c60dc823b088ffbcf48909cf3e371570f3 (HEAD -> master, origin/master, 
mesosphere-private/ci/bevers/tls-hostname-validation)
Author: Benno Evers 
Date:   Wed Jun 26 16:30:12 2019 +0200

Added warnings about known problems with libevent epoll backend.

Some SSL options are known to cause issues in combination with
older versions of libevent. Detect and warn about this situation.

See MESOS-9867 for details.

Review: https://reviews.apache.org/r/70993
{noformat}

> Libevent fd cleanup failure may cause hangs in combination with client 
> certificate validation
> -
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>   if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try close = os::close(request->socket);
> if (close.isError()) {
>   LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
>   }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_add()
> if (ev->ev_events & EV_READ) {
> if (++nread == 1)
> res |= EV_READ;
> }
> if (ev->ev_events & EV_WRITE) {
> if (++nwrite == 1)
> res |= EV_WRITE;
> }
> [...]
> if (res) {
> [...]
> if (evsel->add(base, ev->ev_fd,
> old, (ev->ev_events & EV_ET) | res, extra) == -1)
> return (-1);
> [...]
> }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent 
> for epoll() polling, the process will hang because reads or writes to that 
> file descriptor are never noticed.
> This can be 

[jira] [Comment Edited] (MESOS-9867) Libevent fd cleanup failure may cause hangs in combination with client certificate validation

2019-07-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879280#comment-16879280
 ] 

Benno Evers edited comment on MESOS-9867 at 7/5/19 1:56 PM:


Since this requires very specific conditions in order to happen, I just added a 
warning if these conditions are met.

Over time, this will solve itself as people are using newer and newer libevent 
versions.

{noformat}
commit 1a6760c60dc823b088ffbcf48909cf3e371570f3
Author: Benno Evers 
Date:   Wed Jun 26 16:30:12 2019 +0200

Added warnings about known problems with libevent epoll backend.

Some SSL options are known to cause issues in combination with
older versions of libevent. Detect and warn about this situation.

See MESOS-9867 for details.

Review: https://reviews.apache.org/r/70993
{noformat}


was (Author: bennoe):
Since this requires very specific conditions in order to happen, I just added a 
warning if these conditions are met.

Over time, this will solve itself as people are using newer and newer libevent 
versions.

{noformat}
commit 1a6760c60dc823b088ffbcf48909cf3e371570f3 (HEAD -> master, origin/master, 
mesosphere-private/ci/bevers/tls-hostname-validation)
Author: Benno Evers 
Date:   Wed Jun 26 16:30:12 2019 +0200

Added warnings about known problems with libevent epoll backend.

Some SSL options are known to cause issues in combination with
older versions of libevent. Detect and warn about this situation.

See MESOS-9867 for details.

Review: https://reviews.apache.org/r/70993
{noformat}

> Libevent fd cleanup failure may cause hangs in combination with client 
> certificate validation
> -
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>   if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try close = os::close(request->socket);
> if (close.isError()) {
>   LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
>   }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_add()
> if (ev->ev_events & EV_READ) {

[jira] [Assigned] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket

2019-07-05 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9878:
--

   Resolution: Fixed
 Assignee: Benno Evers
Fix Version/s: 1.9.0

{noformat}
commit ec129665a346f86c738522536f89de7c519f3e0d
Author: Benno Evers 
Date:   Fri Jun 28 20:12:44 2019 +0200

Added ability to pass custom SSL context to `Socket::connect()`.

Users of libprocess can now pass a custom SSL context when
connecting a generic socket via the `Socket::connect()`
function.

Additionally the API of `Socket::connect()` was also reworked
according to the following boundary conditions requested by
libprocess maintainers:

 * When libprocess is compiled without SSL support, neither the
   declaration of the TLS configuration object nor the `connnect()`
   overload that accepts the TLS configuration should be available.
 * Passing just the servername is not an acceptable short-hand for
   using the default TLS configuration together with that servername.
 * When the incorrect overload is selected (i.e. passing TLS config
   to a poll socket or omitting TLS configuration for a TLS socket),
   the program should abort.

This following changes are introduced according to the requirements
above:

 * A new class `openssl::TLSClientConfig` is introduced when libprocess
   is compiled with ssl support.
 * A new overload
   `Socket::connect(const Address&, const TLSClientConfig&)` is
   introduced when libprocess is compiled with ssl support.
 * All call sites are adjusted to check the socket kind before calling
   `connect()`.

Review: https://reviews.apache.org/r/70991
{noformat}

> Enable libprocess users to pass a custom SSL context when using Socket
> --
>
> Key: MESOS-9878
> URL: https://issues.apache.org/jira/browse/MESOS-9878
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Minor
>  Labels: libprocess
> Fix For: 1.9.0
>
>
> Connections made through the `Socket::connect()` API will always use the 
> libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` 
> environment variables.
> Libprocess users might want to override these options while still using the 
> generic socket class.
> Therefore we should provide a way to pass custom configuration to the 
> `Socket::connect()` function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket

2019-07-05 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879278#comment-16879278
 ] 

Benno Evers commented on MESOS-9878:


https://reviews.apache.org/r/70991/

> Enable libprocess users to pass a custom SSL context when using Socket
> --
>
> Key: MESOS-9878
> URL: https://issues.apache.org/jira/browse/MESOS-9878
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Minor
>  Labels: libprocess
>
> Connections made through the `Socket::connect()` API will always use the 
> libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` 
> environment variables.
> Libprocess users might want to override these options while still using the 
> generic socket class.
> Therefore we should provide a way to pass custom configuration to the 
> `Socket::connect()` function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9879) Create a unit test ensuring that a client certificate requests are properly ignored

2019-07-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9879:
--

 Summary: Create a unit test ensuring that a client certificate 
requests are properly ignored
 Key: MESOS-9879
 URL: https://issues.apache.org/jira/browse/MESOS-9879
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


When a TLS server sends a Client Certificate Request as part of the handshake 
and the client does not have a certificate available, the TLS specification 
mandates that the client shall attempt to continue the connection attempt 
sending a zero-length certificate.

We should write a unit test verifying libprocess handles this correctly when 
acting as a client, although it's not completely clear how this might be 
implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket

2019-07-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9878:
--

 Summary: Enable libprocess users to pass a custom SSL context when 
using Socket
 Key: MESOS-9878
 URL: https://issues.apache.org/jira/browse/MESOS-9878
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Connections made through the `Socket::connect()` API will always use the 
libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` 
environment variables.

Libprocess users might want to override these options while still using the 
generic socket class.

Therefore we should provide a way to pass custom configuration to the 
`Socket::connect()` function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9877) Possible segfault due to spurious EPOLLHUP.

2019-07-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9877:
--

 Summary: Possible segfault due to spurious EPOLLHUP.
 Key: MESOS-9877
 URL: https://issues.apache.org/jira/browse/MESOS-9877
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


In Linux, calling `epoll()` on a TCP socket before calling connect() will 
return an EPOLLHUP event on that socket. This can be verified with the 
following code snippet:

{noformat}
#include 
#include 

#include 

int main() {
int epfd = epoll_create1(0);
int s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
struct epoll_event event;
event.events = EPOLLIN;
event.data.u64 = s; // user data
epoll_ctl(epfd, EPOLL_CTL_ADD, s, );

struct epoll_event events[128];
epoll_wait(epfd, events, 128, 500 /*ms*/);
}

// Run using `strace ./a.out`.
{noformat}

Libevent then turns EPOLLHUP into an read/write event:
{noformat}
// epoll.c
if (what & (EPOLLHUP|EPOLLERR)) {
ev = EV_READ | EV_WRITE;
}
[...]
{noformat}

This means, when another thread was inside `epoll_wait()` while that fd is 
added, the wait will return immediately for that new fd.

Apparently, some of either our own or libevent code does not handle this case 
correctly. For example, here is a syscall sequence of `SSLTest.VerifyBadCA` 
failing:
{noformat}
[pid 12012] 1562077806.912193 socket(AF_INET, 
SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 8
[pid 12012] 1562077806.912244 epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, 
u64=8}}) = 0
[pid 12021] 1562077806.912261 <... epoll_wait resumed> [{EPOLLHUP, {u32=8, 
u64=8}}], 32, 100) = 1
[pid 12012] 1562077806.912269 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 12012] 1562077806.912303 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLOUT, 
{u32=8, u64=8}}) = 0
[pid 12021] 1562077806.912371 write(8, 
"\26\3\1\0k\1\0\0g\3\3\r~\336VZ\227I\216\260\304\356\10\200\327\271\320\td\304'O"...,
 112) = -1 EPIPE (Broken pipe)
[pid 12021] 1562077806.912395 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, 
si_pid=12012, si_uid=1000} ---
[pid 12021] 1562077806.912415 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLOUT, {u32=8, 
u64=8}}) = 0
[pid 12021] 1562077806.912435 epoll_ctl(3, EPOLL_CTL_DEL, 8, 0x7fc35be23afc) = 0
[pid 12021] 1562077806.912460 connect(8, {sa_family=AF_INET, 
sin_port=htons(45067), sin_addr=inet_addr("127.0.1.1")}, 16) = -1 EINPROGRESS 
(Operation now in progress)
[pid 12011] 1562077806.912533 <... epoll_wait resumed> [{EPOLLIN, {u32=7, 
u64=7}}], 32, 11) = 1
[pid 12021] 1562077806.912543 epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, 
u64=8}}) = 0
[pid 12011] 1562077806.912562 epoll_ctl(3, EPOLL_CTL_DEL, 7, 0x7f1dbcee0a9c 

[pid 12021] 1562077806.912571 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLOUT, 
{u32=8, u64=8}} 
[pid 12011] 1562077806.912580 <... epoll_ctl resumed> ) = 0
[pid 12021] 1562077806.912586 <... epoll_ctl resumed> ) = 0
[pid 12021] 1562077806.912599 epoll_wait(3, [{EPOLLIN, {u32=6, u64=6}}, 
{EPOLLOUT, {u32=8, u64=8}}], 32, 100) = 2
[pid 12021] 1562077806.912636 write(8, 
"\26\3\1\0k\1\0\0g\3\3\r~\336VZ\227I\216\260\304\356\10\200\327\271\320\td\304'O"...,
 112) = 112
[pid 12021] 1562077806.912684 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN, {u32=8, 
u64=8}}) = 0
[pid 12021] 1562077806.912705 epoll_wait(3,  
[pid 12011] 1562077806.912954 write(2, "W0702 16:30:06.912921 12011 proc"..., 
113W0702 16:30:06.912921 12011 process.cpp:844] Failed to recv on socket 9 to 
peer '127.0.0.1:52578': Decoder error
) = 113
[pid 12011] 1562077806.913004 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, 
u64=7}}) = 0
[pid 12021] 1562077806.913088 <... epoll_wait resumed> [{EPOLLIN, {u32=8, 
u64=8}}], 32, 100) = 1
[pid 12021] 1562077806.913119 epoll_ctl(3, EPOLL_CTL_DEL, 8, 0x7fc35be23afc) = 0
[pid 12011] 1562077806.913159 epoll_wait(3,  
[pid 12021] 1562077806.913168 write(2, "SETTING bev TO NULL 1\n", 22SETTING bev 
TO NULL 1
) = 22
[pid 12021] 1562077806.913219 epoll_wait(3,  
[pid 12003] 1562077806.913233 write(6, "\1\0\0\0\0\0\0\0", 8 
[pid 12011] 1562077806.913253 <... epoll_wait resumed> [{EPOLLIN, {u32=6, 
u64=6}}], 32, 14990) = 1
[pid 12003] 1562077806.913293 <... write resumed> ) = 8
[pid 12011] 1562077806.913375 epoll_wait(3,  
[pid 12012] 1562077806.913412 write(1, "../../../3rdparty/libprocess/src"..., 
122) = 122
[pid 12012] 1562077806.913449 write(6, "\1\0\0\0\0\0\0\0", 8 
[pid 12021] 1562077806.913464 <... epoll_wait resumed> [{EPOLLIN, {u32=6, 
u64=6}}], 32, 99) = 1
[pid 12012] 1562077806.913475 <... write resumed> ) = 8
[pid 12021] 1562077806.913515 --- SIGSEGV {si_signo=SIGSEGV, 
si_code=SEGV_MAPERR, si_addr=0x128} ---
[pid 12020] 1562077807.003305 +++ killed by SIGSEGV (core dumped) +++
{noformat}

As we can see from the above, the first wakeup triggered the `ssl-client` to 
attempt to write the SSL Client Hello to the socket, 

[jira] [Commented] (MESOS-9774) Design client side SSL certificate verification in Libprocess.

2019-07-01 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876306#comment-16876306
 ] 

Benno Evers commented on MESOS-9774:


Design Doc: 
https://docs.google.com/document/d/1O3q7UOXVGNw81xOkRNFPzrtbC__D-N_D_mwV6D--y0k/edit

> Design client side SSL certificate verification in Libprocess.
> --
>
> Key: MESOS-9774
> URL: https://issues.apache.org/jira/browse/MESOS-9774
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: foundations, mesosphere, security, ssl
>
> Notes from an offline discussion with [~vinodkone], [~tillt], [~jgehrcke], 
> [~CarlDellar].
> * Authentication can happen at the transport and/or at the application layer. 
> There is no real benefit in doing it at both layers.
> * Authentication at the application layer allows for subsequent authorization.
> * We would like to have an option to mutually authenticate all components in 
> a Mesos cluster, including external tooling, regardless at which layer, to 
> secure communication channels.
> * Mutual authentication at the transport layer everywhere can be hard because 
> some components can't or don't want to provide certificates, e.g., a Lua HTTP 
> client reading master's state.
> * Theoretically, some components, e.g., Mesos masters and agents, can form an 
> ensemble inside which all connections are authenticated on both sides at the 
> transport layer (TLS certificate verification). Practically, it may then be 
> hard to implement communication with the components outside such ensemble, 
> e.g., frameworks, executors, since at least two types of connections/sockets 
> should be distinguished: with and without client certificate verification 
> (Libprocess can't do it now), or all the traffic between the ensemble and 
> outside components should go via a proxy.
> * An alternative is to combine server side TLS certificate verification with 
> the client side application layer authentication. For that to be secure, we 
> need to implement client authentication for Mesos components, e.g., master 
> with agent, replica with other replica (see MESOS-9638). Plus relax 
> certificate verification option in Libprocess for outgoing connections only. 
> For non-streaming connections a secret connection identifier should be passed 
> by the client to prove they are the entity that has been previously 
> authenticated.
> * Whatever path we choose, truly secure communication channels will become 
> when separate certificates for Mesos components are used, either signed by a 
> different root CA or using a specific CN/SAN, which can't be obtained by 
> everyone.
> What needs to be done:
> * Introduce or adjust the Libprocess flag for verifying certificates for 
> outgoing connections only.
> * Verify how replicas in the master's replicated log discover other replicas 
> and what harm a rogue replica can do if it tries to join the quorum. Estimate 
> whether master's replicated log can use its own copy of Libprocess.
> * Implement Mesos master authentication with Mesos agents, MESOS-9638.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests

2019-06-27 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873908#comment-16873908
 ] 

Benno Evers edited comment on MESOS-9867 at 6/27/19 11:25 AM:
--

This was fixed upstream with libevent 2.1.8: 
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a


was (Author: bennoe):
This was fixed upstream with libevent-2.1.10: 
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a

> Libevent fd cleanup failure may cause hangs in subsequent tests
> ---
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>   if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try close = os::close(request->socket);
> if (close.isError()) {
>   LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
>   }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_add()
> if (ev->ev_events & EV_READ) {
> if (++nread == 1)
> res |= EV_READ;
> }
> if (ev->ev_events & EV_WRITE) {
> if (++nwrite == 1)
> res |= EV_WRITE;
> }
> [...]
> if (res) {
> [...]
> if (evsel->add(base, ev->ev_fd,
> old, (ev->ev_events & EV_ET) | res, extra) == -1)
> return (-1);
> [...]
> }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent 
> for epoll() polling, the process will hang because reads or writes to that 
> file descriptor are never noticed.
> This can be reproduced for example by running a test where the 
> `verify()`-callback fails on the server side twice in a row: (note, the 
> LIBPROCESS_IP below is set in order to induce a test failure, result may vary 
> on your local network and ssl configuration)
> {noformat}
> LIBPROCESS_IP=127.0.1.1 ./libprocess-tests 
> --gtest_filter="*VerifyCertificate*" --gtest_repeat=2
> {noformat}
> There is a chance that the issue described here is the same as the ominous 

[jira] [Commented] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests

2019-06-27 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873908#comment-16873908
 ] 

Benno Evers commented on MESOS-9867:


This was fixed upstream with libevent-2.1.10: 
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a

> Libevent fd cleanup failure may cause hangs in subsequent tests
> ---
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity 
> during the OpenSSL handshake and afterwards call the `openssl::verify()` 
> function to perform hostname validation and other checks on the client 
> certificate. If these checks fail, the bufferevent is deleted and the 
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>   if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << 
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try close = os::close(request->socket);
> if (close.isError()) {
>   LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
>   }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already 
> registered that file descriptor with epoll() to watch for read and write 
> events on that socket. Since the socket is closed, attempts to remove the 
> corresponding fd from the epoll() structs will fail: (See also: 
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel 
> object that was associated with fd 9 from the data structure associated with 
> that epoll instance in the kernel. So while we get an error attempting to 
> remove fd 9, there is actually nothing left to remove. However, in a case of 
> epoll failure, libprocess does not adjust the number of readers and writers 
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file 
> descriptor. That still wouldn't be so bad, however libevent also only adds 
> file descriptors to `epoll()` struct the *first* time we attempt to create a 
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> if (ev->ev_events & EV_READ) {
> if (++nread == 1)
> res |= EV_READ;
> }
> if (ev->ev_events & EV_WRITE) {
> if (++nwrite == 1)
> res |= EV_WRITE;
> }
> [...]
> if (res) {
> [...]
> if (evsel->add(base, ev->ev_fd,
> old, (ev->ev_events & EV_ET) | res, extra) == -1)
> return (-1);
> [...]
> }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent 
> for epoll() polling, the process will hang because reads or writes to that 
> file descriptor are never noticed.
> This can be reproduced for example by running a test where the 
> `verify()`-callback fails on the server side twice in a row: (note, the 
> LIBPROCESS_IP below is set in order to induce a test failure, result may vary 
> on your local network and ssl configuration)
> {noformat}
> LIBPROCESS_IP=127.0.1.1 ./libprocess-tests 
> --gtest_filter="*VerifyCertificate*" --gtest_repeat=2
> {noformat}
> There is a chance that the issue described here is the same as the ominous 
> "issues" described in MESOS-3008, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests

2019-06-26 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9867:
--

 Summary: Libevent fd cleanup failure may cause hangs in subsequent 
tests
 Key: MESOS-9867
 URL: https://issues.apache.org/jira/browse/MESOS-9867
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


A listening LibeventSSLSocket will check cryptographic certificate validity 
during the OpenSSL handshake and afterwards call the `openssl::verify()` 
function to perform hostname validation and other checks on the client 
certificate. If these checks fail, the bufferevent is deleted and the 
connection closed:
{noformat}
// libevent_ssl_socket.cpp, accept_SSL_callback()
  if (verify.isError()) {
VLOG(1) << "Failed accept, verification error: " << verify.error();
request->promise.fail(verify.error());
SSL_free(ssl);
bufferevent_free(bev);
// TODO(jmlvanre): Clean up for readability. Consider RAII
// or constructing the impl earlier.
CHECK(request->socket >= 0);
Try close = os::close(request->socket);
if (close.isError()) {
  LOG(FATAL)
<< "Failed to close socket " << stringify(request->socket)
<< ": " << close.error();
}
delete request;
return;
  }
{noformat}

However, when we close the socket fd in the above code, libevent had already 
registered that file descriptor with epoll() to watch for read and write events 
on that socket. Since the socket is closed, attempts to remove the 
corresponding fd from the epoll() structs will fail: (See also: 
https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
{noformat}
[warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 
(del); write change was 0 (none): Bad file descriptor
[warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 
(none); write change was 2 (del): Bad file descriptor
{noformat}

However, that in itself is harmless since the kernel will remove the kernel 
object that was associated with fd 9 from the data structure associated with 
that epoll instance in the kernel. So while we get an error attempting to 
remove fd 9, there is actually nothing left to remove. However, in a case of 
epoll failure, libprocess does not adjust the number of readers and writers on 
that file descriptor:
{noformat}
// evmap.c, evmap_io_del()
[...]
if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
   return (-1);

[...]
ctx->nread = nread;
ctx->nwrite = nwrite;
{noformat}

In the above, ctx is part of an array collecting information for each file 
descriptor. That still wouldn't be so bad, however libevent also only adds file 
descriptors to `epoll()` struct the *first* time we attempt to create a read or 
write event on that file descriptor:

{noformat}
// evmap.c, evmap_io_del()
if (ev->ev_events & EV_READ) {
if (++nread == 1)
res |= EV_READ;
}
if (ev->ev_events & EV_WRITE) {
if (++nwrite == 1)
res |= EV_WRITE;
}

[...]

if (res) {
[...]
if (evsel->add(base, ev->ev_fd,
old, (ev->ev_events & EV_ET) | res, extra) == -1)
return (-1);
[...]
}
{noformat}

So when the same file descriptor is attempted to be used again by libevent for 
epoll() polling, the process will hang because reads or writes to that file 
descriptor are never noticed.

This can be reproduced for example by running a test where the 
`verify()`-callback fails on the server side twice in a row: (note, the 
LIBPROCESS_IP below is set in order to induce a test failure, result may vary 
on your local network and ssl configuration)
{noformat}
LIBPROCESS_IP=127.0.1.1 ./libprocess-tests --gtest_filter="*VerifyCertificate*" 
--gtest_repeat=2
{noformat}

There is a chance that the issue described here is the same as the ominous 
"issues" described in MESOS-3008, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9863) Libprocess SSL tests may fail client certificate validation

2019-06-25 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9863:
--

 Summary: Libprocess SSL tests may fail client certificate 
validation
 Key: MESOS-9863
 URL: https://issues.apache.org/jira/browse/MESOS-9863
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


In the current libprocess `ssl_tests.cpp`, we create a "valid" server 
certificate containing the hostname returned by ::getnameinfo() for the IP of 
`libprocess::address()`. The libprocess IP is by default determined by a DNS 
lookup for the current hostname.

As an example, let's assume my hostname is `poincare` and the libprocess IP is 
`127.0.1.1`.

The tests then spawn the `ssl-client` binary as a subprocess passing the server 
IP as a command-line argument. The `ssl-client` binary will connect to the 
passed IP. Since we do not bind() before calling connect, the source IP for 
that connection will be automatically determined by the kernel.

Continuing the example, the `ssl-client` connects to 127.0.1.1. Since it is a 
loopback address, the kernel will automatically select 127.0.0.1 as the source 
IP.

On the server side, libprocess will now do a reverse DNS lookup on the source  
IP to determine the hostname of the connecting client. If it doesnt match the 
provided client certificate, the connection is rejected.

In the example, libprocess will determine (127.0.0.1, 'localhost') as source 
ip/hostname, but the certificate contains (127.0.1.1, 'poincare'). Therefore, 
the connection attempt is rejected.


Possible solutions to this include binding before calling connect to fix the 
source ip, or only running these tests with the 'openssl' hostname validation 
scheme after the corresponding review chain has landed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9859) Deprecate 'libprocess' hostname validation scheme

2019-06-24 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9859:
--

 Summary: Deprecate 'libprocess' hostname validation scheme
 Key: MESOS-9859
 URL: https://issues.apache.org/jira/browse/MESOS-9859
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


After https://reviews.apache.org/r/70884 or an equivalent change has landed and 
the `--hostname_validation_scheme` flag was added to Mesos, we should deprecate 
the `libprocess` setting for the `--hostname_validation_scheme` flag in 
libprocess.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9858) Remember hostname for a UPID.

2019-06-24 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9858:
--

 Summary: Remember hostname for a UPID.
 Key: MESOS-9858
 URL: https://issues.apache.org/jira/browse/MESOS-9858
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


When specifying a UPID like `mas...@mesos.example.org`, when parsing the string 
we will resolve `mesos.example.org` and store the resolved IP address in the 
`address` member of the UPID, while the hostname is discarded.

We should remember that hostname. This will serve two purposes:
- First, we can then display it e.g. in the WebUI or in logs, saving the user 
from having to reverse the DNS lookup manually
- The hostname can be used for certificate hostname validation when the UPID is 
remote and we're connecting via TLS to that actor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9857) Switch default hostname validation in Mesos

2019-06-24 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9857:
--

 Summary: Switch default hostname validation in Mesos
 Key: MESOS-9857
 URL: https://issues.apache.org/jira/browse/MESOS-9857
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


After https://reviews.apache.org/r/70795 has landed, we will continue using the 
legacy hostname validation scheme by default, exposing users to increased MitM 
risk and to hangs caused by reverse DNS lookups.

With the next major release, we should change the default to the 'openssl' 
scheme and remove the legacy behaviour.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9855) Provide Hooks to Modules for defining hostname validation policy

2019-06-21 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9855:
--

 Summary: Provide Hooks to Modules for defining hostname validation 
policy
 Key: MESOS-9855
 URL: https://issues.apache.org/jira/browse/MESOS-9855
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


After the changes in MESOS-9809 will have landed, operators will be able to 
ensure that libprocess-enabled programs are performing RFC6125-compliant TLS 
hostname validation on all outgoing connections.

However, client hostname validation will not be done since there's no 
*standard* way of doing that. Instead, the application layer should set the 
policy on which certificate fields are considered a valid and accepted proof of 
identity.

In order to do that, we should provide hooks for Mesos modules, so they can 
select the hostname policy for client (and probably also for server) hostname 
validation.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9810) Reject certificate-less ciphers when certificate verification is enabled

2019-06-19 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867726#comment-16867726
 ] 

Benno Evers commented on MESOS-9810:


Review: https://reviews.apache.org/r/70748/

> Reject certificate-less ciphers when certificate verification is enabled
> 
>
> Key: MESOS-9810
> URL: https://issues.apache.org/jira/browse/MESOS-9810
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Priority: Major
>  Labels: foundations
>
> A TLS server is required by the spec to always send a server certificate, 
> unless an anonymous cipher is used.
> In libprocess, this certificate is verified to be valid and trusted when the 
> flag LIBPROCESS_VERIFY_CERT is set to true.
> However, when an anonymous cipher is used, the server does not present a 
> certificate, meaning the verification step will not happen. If a TLS server 
> would be allowed to use such a cipher, it could trivially sidestep the 
> security provided by certificate verification.
> Therefore, we should always reject connections using anonymous ciphers when 
> certificate verification is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9811) Don't use reverse DNS for hostname validation

2019-06-19 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867724#comment-16867724
 ] 

Benno Evers commented on MESOS-9811:


Review: https://reviews.apache.org/r/70749/

> Don't use reverse DNS for hostname validation
> -
>
> Key: MESOS-9811
> URL: https://issues.apache.org/jira/browse/MESOS-9811
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: foundations, libprocess, ssl
>
> Upon connection we first resolve the hostname and forget about it
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1462-L1504
> then later use reverse DNS on the remote address to get back a hostname
> https://github.com/apache/mesos/blob/4708c2a368e12a89669135f4d0dd05d9b0b2/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L548-L556
> and verify the server certificate against *that*.
> Instead, we should verify the server certificate against the hostname that 
> was used by t he client to initiate the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9809) Use OpenSSL built-in functions for hostname validation

2019-06-19 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867722#comment-16867722
 ] 

Benno Evers commented on MESOS-9809:


Review: https://reviews.apache.org/r/70749/

> Use OpenSSL built-in functions for hostname validation
> --
>
> Key: MESOS-9809
> URL: https://issues.apache.org/jira/browse/MESOS-9809
> Project: Mesos
>  Issue Type: Task
>Reporter: Benno Evers
>Priority: Major
>  Labels: foundations, libprocess, ssl-tls
>
> We traditionally use a hand-written hostname validation algorithm in 
> libprocess that is based on the example code in 
> https://wiki.openssl.org/index.php/Hostname_validation
> However, since OpenSSL 1.1.0, there is a new built-in function API 
> `SSL_set1_host()` that can be used to let OpenSSL handle hostname validation 
> during the TLS handshake in a standardized manner.
> We should take advantage of this when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9791) Libprocess does not support server only SSL certificate verification.

2019-06-07 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858670#comment-16858670
 ] 

Benno Evers commented on MESOS-9791:


After some discussion, we noticed that existing libprocess configuration is 
almost sufficient to achieve the configuration (3).

In particular, we created this table of the current libprocess behaviour as of 
1.8.0: 
https://docs.google.com/document/d/1sSzjyJ5odsNgv1BgsDQOopwNbj-ufzAA5rA4ColWXPU/edit

Setting `LIBPROCESS_SSL_VERIFY_CERT=true` and 
`LIBPROCESS_SSL_REQUIRE_CERT=false` will result in the following behaviour:
 - Require valid peer certificate in client mode unless an anonymous cipher is 
used
 - Send certificate in server mode
 - Send certificate in client mode if present
 - Verify client certificate in server mode if present.

After MESOS-9810 is landed, this will *always* require a valid peer certificate 
in client mode, fulfilling the requirements.


Note: With this setting, libprocess will always send a Client Certificate 
Request during the TLS handshake, but that is not as bad as it sounds since the 
TLS protocol specifies that a client MUST respond with an empty certificate 
response if it has no valid certificate to present. The server will then accept 
an empty certificate because `require_cert` was not set.

> Libprocess does not support server only SSL certificate verification.
> -
>
> Key: MESOS-9791
> URL: https://issues.apache.org/jira/browse/MESOS-9791
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: foundations, mesosphere, security, ssl, tls
>
> Currently SSL certificate verification in Libprocess can be configured in the 
> [following 
> ways|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L88-L97]:
> (1) send certificate if in server mode, verify peer certificates *if present*;
> (2) require valid peer certificates in *both* client and server modes.
> It is currently impossible to configure a Libprocess instance to 
> simultaneously:
> (3) require valid peer certificate in client mode and send certificate in 
> server mode.
> Because Libprocess is often used by programs that act both as servers and 
> clients, implementing (3) is necessary to enable the so-called 
> webserver-browser model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9811) Don't use reverse DNS for hostname validation

2019-06-03 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9811:
--

 Summary: Don't use reverse DNS for hostname validation
 Key: MESOS-9811
 URL: https://issues.apache.org/jira/browse/MESOS-9811
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Upon connection we first resolve the hostname and forget about it

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1462-L1504

then later use reverse DNS on the remote address to get back a hostname

https://github.com/apache/mesos/blob/4708c2a368e12a89669135f4d0dd05d9b0b2/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L548-L556

and verify the server certificate against *that*.

Instead, we should verify the server certificate against the hostname that was 
used by t he client to initiate the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9790) Libprocess does not use standard tooling for hostname validation.

2019-06-03 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855123#comment-16855123
 ] 

Benno Evers commented on MESOS-9790:


Since `X509_check_host()` is disencouraged by the OpenSSL documentation, this 
will be implemented by MESOS-9809 instead.

> Libprocess does not use standard tooling for hostname validation. 
> --
>
> Key: MESOS-9790
> URL: https://issues.apache.org/jira/browse/MESOS-9790
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: foundations, mesosphere, security, ssl, tls
>
> Libprocess currently uses [custom 
> code|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L755-L863]
>  for hostname validation in its SSL certificate verification workflow. 
> However openssl provides a function for this, [{{X509_check_host()}} 
> |https://www.openssl.org/docs/manmaster/man3/X509_check_host.html].
> For safety and reliability, we should enable an option to use 
> {{X509_check_host()}} for hostname validation instead of our custom code, but 
> preserve the custom code for backward compatibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9809) Use OpenSSL built-in functions for hostname validation

2019-06-03 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9809:
--

 Summary: Use OpenSSL built-in functions for hostname validation
 Key: MESOS-9809
 URL: https://issues.apache.org/jira/browse/MESOS-9809
 Project: Mesos
  Issue Type: Task
Reporter: Benno Evers


We traditionally use a hand-written hostname validation algorithm in libprocess 
that is based on the example code in 
https://wiki.openssl.org/index.php/Hostname_validation

However, since OpenSSL 1.1.0, there is a new built-in function API 
`SSL_set1_host()` that can be used to let OpenSSL handle hostname validation 
during the TLS handshake in a standardized manner.

We should take advantage of this when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9797) SSL Ciphersuite settings can break client TLS handshake

2019-05-27 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9797:
--

 Summary: SSL Ciphersuite settings can break client TLS handshake
 Key: MESOS-9797
 URL: https://issues.apache.org/jira/browse/MESOS-9797
 Project: Mesos
  Issue Type: Improvement
 Environment: Ubuntu 18.04 w/ OpenSSL 1.1.0g
Reporter: Benno Evers


Starting a mesos-agent with the following environment variables:

{noformat}
env GLOG_v=2 LIBPROCESS_SSL_ENABLED=true LIBPROCESS_SSL_ENABLE_DOWNGRADE=false 
LIBPROCESS_SSL_VERIFY_CERT=false 
LIBPROCESS_SSL_CERT_FILE=/etc/ssl/certs/ssl-cert-snakeoil.pem 
LIBPROCESS_SSL_KEY_FILE=/etc/ssl/private/ssl-cert-snakeoil.key 
LIBPROCESS_SSL_CIPHERS=ECDHE-PSK-AES128-CBC-SHA mesos-agent 
--work_dir=/tmp/ --master=127.0.1.1:4447 --systemd_enable_support=false
{noformat}

caused a mesos-agent on my machine (using openssl 1.1.0g) to fail to send a 
ClientHello message after establishing a tcp connection to the given master, 
causing the TLS handshake to fail.

Removing the `LIBPROCESS_SSL_CIPHERS=ECDHE-PSK-AES128-CBC-SHA` variable had the 
agent able to connect normally.

The reason for this still needs to be investigated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9697) Release RPMs are not uploaded to bintray

2019-05-21 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844858#comment-16844858
 ] 

Benno Evers edited comment on MESOS-9697 at 5/21/19 1:55 PM:
-

{noformat}
commit 35fa8420762f1e8cf8df211ddcdaa0cc77104bc2
Author: Benno Evers 
Date:   Fri May 17 11:24:23 2019 +0200

Updated bintray upload scripts to remove hard-coded accounts.

This updates the Jenkinsfile used in the ASF Jenkins as well
as the associated upload script for bintray to make the
used credential id configurable and to upload the built packages
to the official `apache/mesos` bintray account.

Review: https://reviews.apache.org/r/70573/
{noformat}
{noformat}
commit ae23ded6b9b2ec3b2b6ec494408dc3d6e1df (bevers/apache-jenkins-bintray)
Author: Benno Evers 
Date:   Fri May 17 18:46:34 2019 +0200

Rearranged 'Downloads' page and updated bintray URL.

Updated the bintray link on the `Downloads` page to point
to the `apache/mesos` account instead of the `mesos`
account.

In addition, several minor formatting changes were done:

  * Added a space after the colon in the `Getting older
Mesos binaries` section.

  * Moved links to the getting started guide to the top
of the document.

  * Used a list to present the download links to the latest
stable release.

  * Used `` instead of `` tags for the link to the
ASF git repository.

Review: https://reviews.apache.org/r/70526/
{noformat}

The changes in the commit above did return the Apache CI to a state where 
packages are built succesfully, however the bintray upload is still failing 
without any error message.


was (Author: bennoe):
{noformat}
commit 35fa8420762f1e8cf8df211ddcdaa0cc77104bc2
Author: Benno Evers 
Date:   Fri May 17 11:24:23 2019 +0200

Updated bintray upload scripts to remove hard-coded accounts.

This updates the Jenkinsfile used in the ASF Jenkins as well
as the associated upload script for bintray to make the
used credential id configurable and to upload the built packages
to the official `apache/mesos` bintray account.

Review: https://reviews.apache.org/r/70573/
{noformat}

The changes in the commit above did return the Apache CI to a state where 
packages are built succesfully, however the bintray upload is still failing 
without any error message.

> Release RPMs are not uploaded to bintray
> 
>
> Key: MESOS-9697
> URL: https://issues.apache.org/jira/browse/MESOS-9697
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.2, 1.7.2, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations, integration, jenkins, packaging, rpm
>
> While we currently build release RPMs, e.g., 
> [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/],
>  these artifacts are not uploaded to bintray. Due to that RPM links on the 
> downloads page [http://mesos.apache.org/downloads/] are broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9697) Release RPMs are not uploaded to bintray

2019-05-21 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844858#comment-16844858
 ] 

Benno Evers commented on MESOS-9697:


{noformat}
commit 35fa8420762f1e8cf8df211ddcdaa0cc77104bc2
Author: Benno Evers 
Date:   Fri May 17 11:24:23 2019 +0200

Updated bintray upload scripts to remove hard-coded accounts.

This updates the Jenkinsfile used in the ASF Jenkins as well
as the associated upload script for bintray to make the
used credential id configurable and to upload the built packages
to the official `apache/mesos` bintray account.

Review: https://reviews.apache.org/r/70573/
{noformat}

The changes in the commit above did return the Apache CI to a state where 
packages are built succesfully, however the bintray upload is still failing 
without any error message.

> Release RPMs are not uploaded to bintray
> 
>
> Key: MESOS-9697
> URL: https://issues.apache.org/jira/browse/MESOS-9697
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.2, 1.7.2, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Critical
>  Labels: foundations, integration, jenkins, packaging, rpm
>
> While we currently build release RPMs, e.g., 
> [https://builds.apache.org/view/M-R/view/Mesos/job/Packaging/job/CentOS/job/1.7.x/],
>  these artifacts are not uploaded to bintray. Due to that RPM links on the 
> downloads page [http://mesos.apache.org/downloads/] are broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error

2019-05-20 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839483#comment-16839483
 ] 

Benno Evers edited comment on MESOS-9329 at 5/20/19 11:53 AM:
--

Indeed, the autotools build uses a older version of libevent, 
[2.0.22|https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/3rdparty/libevent-2.0.22-stable.tar.gz].
 We can't easily use it in the cmake build because newer versions do not 
support cmake, see MESOS-3529. Bottom line is: a cmake build on Linux with ssl 
and libevent enabled is currently not supported.


was (Author: alexr):
Indeed, the autotools build uses a newer version of libevent, 
[2.0.22|https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/3rdparty/libevent-2.0.22-stable.tar.gz].
 We can't easily use it in the cmake build because newer versions do not 
support cmake, see MESOS-3529. Bottom line is: a cmake build on Linux with ssl 
and libevent enabled is currently not supported.

> CMake build on Fedora 28 fails due to libevent error
> 
>
> Key: MESOS-9329
> URL: https://issues.apache.org/jira/browse/MESOS-9329
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> Trying to build Mesos using cmake with the options 
> {noformat}
> cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1
> {noformat}
> fails due to the following:
> {noformat}
> [  1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  In function ‘bio_bufferevent_new’:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
>  error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
>   b->init = 0;
>^~
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  At top level:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1:
>  error: variable ‘methods_bufferevent’ has initializer but incomplete type
>  static BIO_METHOD methods_bufferevent = {
> [...]
> {noformat}
> Since the autotools build does not have issues when enabling libevent and 
> ssl, it seems most likely that the `libevent-2.1.5-beta` version used by 
> default in the cmake build is somehow connected to the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9677) RPM packages should be built with launcher sealing

2019-05-14 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9677:
--

  Resolution: Fixed
Assignee: Benno Evers
Target Version/s:   (was: 1.8.0)

> RPM packages should be built with launcher sealing
> --
>
> Key: MESOS-9677
> URL: https://issues.apache.org/jira/browse/MESOS-9677
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: integration, mesosphere, packaging, rpm
> Fix For: 1.8.1
>
>
> We should consider enabling launcher sealing in the Mesos RPM packages. Since 
> this feature is built conditionally, it is hard to write e.g., module code 
> against Mesos packages since required functions might be missing (e.g., 
> [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
>  cannot be linked against the default RPM package anymore). The RPM's target 
> platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9677) RPM packages should be built with launcher sealing

2019-05-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839611#comment-16839611
 ] 

Benno Evers commented on MESOS-9677:


master:
{noformat}
commit 7ff4263c371f8a51551199c694837cf371923137
Author: Benjamin Bannier 
Date:   Mon Apr 8 14:37:32 2019 +0200

Enabled launcher sealing for RPM packages.

We enable this flag since with it disabled certain public functions
are not available making it hard to e.g., write modules against this
version of Mesos.

While launcher sealing depends on a recent kernel, the platform we
build RPMs for already satisfies the requirements.

Review: https://reviews.apache.org/r/70295
{noformat}

1.8.x:
{noformat}
commit 3bc6082afe75390dc3b0abd58d6ce85827709b89 (origin/1.8.x, 1.8.x)
Author: Benjamin Bannier 
Date:   Mon Apr 8 14:37:32 2019 +0200

Enabled launcher sealing for RPM packages.

We enable this flag since with it disabled certain public functions
are not available making it hard to e.g., write modules against this
version of Mesos.

While launcher sealing depends on a recent kernel, the platform we
build RPMs for already satisfies the requirements.

Review: https://reviews.apache.org/r/70295
{noformat}

> RPM packages should be built with launcher sealing
> --
>
> Key: MESOS-9677
> URL: https://issues.apache.org/jira/browse/MESOS-9677
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: integration, mesosphere, packaging, rpm
>
> We should consider enabling launcher sealing in the Mesos RPM packages. Since 
> this feature is built conditionally, it is hard to write e.g., module code 
> against Mesos packages since required functions might be missing (e.g., 
> [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
>  cannot be linked against the default RPM package anymore). The RPM's target 
> platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839310#comment-16839310
 ] 

Benno Evers commented on MESOS-9783:


master:
{noformat}
commit a9a2acabd03181865055b77cf81e7bb310b236d6
Author: Benno Evers 
Date:   Tue May 14 11:52:59 2019 +0200

Updated URL in CentOS 6 Dockerfile.

The link was pointing to a rpm package that was apparently
replaced on the upstream file server.

Review: https://reviews.apache.org/r/70639
{noformat}

1.8.x:
{noformat}
commit 5ca16bfeae19c193f4e67390543d08897a0f4ab8
Author: Benno Evers 
Date:   Tue May 14 11:52:59 2019 +0200

Updated URL in CentOS 6 Dockerfile.

The link was pointing to a rpm package that was apparently
replaced on the upstream file server.

Review: https://reviews.apache.org/r/70639
{noformat}

1.7.x:
{noformat}
commit cfc7e6e9905329460d182150a91317b3e0a75157
Author: Benno Evers 
Date:   Tue May 14 11:52:59 2019 +0200

Updated URL in CentOS 6 Dockerfile.

The link was pointing to a rpm package that was apparently
replaced on the upstream file server.

Review: https://reviews.apache.org/r/70639
{noformat}

> Centos 6 RPM build is broken on Apache CI
> -
>
> Key: MESOS-9783
> URL: https://issues.apache.org/jira/browse/MESOS-9783
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>
> The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
> since April 16, as it fails on the following step:
> {noformat}
> RUN  rpm -Uvh --replacepkgs \
>   
> http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
> {noformat}
> The URL returns a 404 response because the package was removed from the 
> upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839282#comment-16839282
 ] 

Benno Evers commented on MESOS-9783:


https://reviews.apache.org/r/70639/diff/1#index_header

> Centos 6 RPM build is broken on Apache CI
> -
>
> Key: MESOS-9783
> URL: https://issues.apache.org/jira/browse/MESOS-9783
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>
> The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
> since April 16, as it fails on the following step:
> {noformat}
> RUN  rpm -Uvh --replacepkgs \
>   
> http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
> {noformat}
> The URL returns a 404 response because the package was removed from the 
> upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9783:
--

Assignee: Benno Evers

> Centos 6 RPM build is broken on Apache CI
> -
>
> Key: MESOS-9783
> URL: https://issues.apache.org/jira/browse/MESOS-9783
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>
> The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
> since April 16, as it fails on the following step:
> {noformat}
> RUN  rpm -Uvh --replacepkgs \
>   
> http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
> {noformat}
> The URL returns a 404 response because the package was removed from the 
> upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9783:
--

 Summary: Centos 6 RPM build is broken on Apache CI
 Key: MESOS-9783
 URL: https://issues.apache.org/jira/browse/MESOS-9783
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
since April 16, as it fails on the following step:

{noformat}
RUN  rpm -Uvh --replacepkgs \
  
http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
{noformat}

The URL returns a 404 response because the package was removed from the 
upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9776) Mention removal of *.json endpoints in 1.8.0 CHANGELOG

2019-05-08 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9776:
--

 Summary: Mention removal of *.json endpoints in 1.8.0 CHANGELOG
 Key: MESOS-9776
 URL: https://issues.apache.org/jira/browse/MESOS-9776
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


We should mention in the CHANGELOG and update notes that the *.json that were 
deprecated in Mesos 0.25 were actually removed in Mesos 1.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >