[jira] [Commented] (MESOS-10090) Mesos build on Windows appears to be broken.

2020-01-27 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024539#comment-17024539
 ] 

Joseph Wu commented on MESOS-10090:
---

I was planning to fix this in the SSL downgrade chain (up for review): 
https://reviews.apache.org/r/72018/

> Mesos build on Windows appears to be broken.
> 
>
> Key: MESOS-10090
> URL: https://issues.apache.org/jira/browse/MESOS-10090
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows, MSVC
>Reporter: Till Toenshoff
>Priority: Blocker
>
> I was told that when trying to build the latest Mesos (master - 1.10 WIP), 
> MSVC complains about our use of domain sockets;
> {noformat}
> mesos\src\slave/slave.hpp(133,40): error C3083: ‘unix’: the symbol to the 
> left of a ‘::’ must be a type
> mesos\src\slave/slave.hpp(877,28): error C3083: ‘unix’: the symbol to the 
> left of a ‘::’ must be a type
> \mesos\src\slave\slave.cpp(203,45): error C3083: ‘unix’: the symbol to the 
> left of a ‘::’ must be a typemesos\3rdparty\libprocess\src\http.cpp(1628,18): 
> error C3083: ‘unix’: the symbol to the left of a ‘::’ must be a type
> mesos\3rdparty\libprocess\src\http.cpp(1629,16): error C3083: ‘unix’: the 
> symbol to the left of a ‘::’ must be a type {noformat}
> This entirely prevents building on MSVC while no workaround is known - 
> declaring this as a blocker for those reasons.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Deleted] (MESOS-10012) Implement SSL socket downgrading on the native Windows SSL socket.

2020-01-23 Thread Joseph Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu deleted MESOS-10012:
--


> Implement SSL socket downgrading on the native Windows SSL socket.
> --
>
> Key: MESOS-10012
> URL: https://issues.apache.org/jira/browse/MESOS-10012
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: foundations
>
> The logic needed to determine whether a connection is SSL or not is already 
> established in the libevent SSL socket:
> {code}
>   // Based on the function 'ssl23_get_client_hello' in openssl, we
>   // test whether to dispatch to the SSL or non-SSL based accept based
>   // on the following rules:
>   //   1. If there are fewer than 3 bytes: non-SSL.
>   //   2. If the 1st bit of the 1st byte is set AND the 3rd byte is
>   //  equal to SSL2_MT_CLIENT_HELLO: SSL.
>   //   3. If the 1st byte is equal to SSL3_RT_HANDSHAKE AND the 2nd
>   //  byte is equal to SSL3_VERSION_MAJOR and the 6th byte is
>   //  equal to SSL3_MT_CLIENT_HELLO: SSL.
>   //   4. Otherwise: non-SSL.
>   // For an ascii based protocol to falsely get dispatched to SSL it
>   // needs to:
>   //   1. Start with an invalid ascii character (0x80).
>   //   2. OR have the first 2 characters be a SYN followed by ETX, and
>   //  then the 6th character be SOH.
>   // These conditions clearly do not constitute valid HTTP requests,
>   // and are unlikely to collide with other existing protocols.
>   bool ssl = false; // Default to rule 4.
>   if (size < 2) { // Rule 1.
> ssl = false;
>   } else if ((data[0] & 0x80) && data[2] == SSL2_MT_CLIENT_HELLO) { // Rule 2.
> ssl = true;
>   } else if (data[0] == SSL3_RT_HANDSHAKE &&
>  data[1] == SSL3_VERSION_MAJOR &&
>  data[5] == SSL3_MT_CLIENT_HELLO) { // Rule 3.
> ssl = true;
>   }
> {code}
> This only requires us to peek at the first 6 bytes of data.  One possible 
> complication is that Overlapped sockets do not support peeking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10073) Implement SSL downgrade on the native SSL socket

2020-01-13 Thread Joseph Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-10073:
-

Assignee: Joseph Wu

> Implement SSL downgrade on the native SSL socket
> 
>
> Key: MESOS-10073
> URL: https://issues.apache.org/jira/browse/MESOS-10073
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: foundations, ssl
>
> The new SSL socket implementation (the non-libevent one) does not currently 
> implement the SSL downgrade hack.  We could probably use {{peek}} to achieve 
> the same result, or modify our socket BIO to look at the first few bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10073) Implement SSL downgrade on the native SSL socket

2019-12-18 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999554#comment-16999554
 ] 

Joseph Wu commented on MESOS-10073:
---

This is just the guard against using the feature tracked by this JIRA
{code}
commit 34bac34419ebec8441e69d3a5684381468352399
Author: Joseph Wu 
Date:   Tue Dec 17 15:23:27 2019 -0800

SSL Socket: Guarded against downgrade while unimplemented.

The SSL downgrade feature present in our libevent-SSL socket
is currently not supported on the plain-OpenSSL socket.

For this reason, we make sure to check the related flag and
prevent the related tests from running.

Review: https://reviews.apache.org/r/71923
{code}

> Implement SSL downgrade on the native SSL socket
> 
>
> Key: MESOS-10073
> URL: https://issues.apache.org/jira/browse/MESOS-10073
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Priority: Minor
>  Labels: foundations, ssl
>
> The new SSL socket implementation (the non-libevent one) does not currently 
> implement the SSL downgrade hack.  We could probably use {{peek}} to achieve 
> the same result, or modify our socket BIO to look at the first few bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10073) Implement SSL downgrade on the native SSL socket

2019-12-17 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998672#comment-16998672
 ] 

Joseph Wu commented on MESOS-10073:
---

Review guarding against usage of this feature, until we implement downgrade:
https://reviews.apache.org/r/71923/

> Implement SSL downgrade on the native SSL socket
> 
>
> Key: MESOS-10073
> URL: https://issues.apache.org/jira/browse/MESOS-10073
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Priority: Minor
>  Labels: foundations, ssl
>
> The new SSL socket implementation (the non-libevent one) does not currently 
> implement the SSL downgrade hack.  We could probably use {{peek}} to achieve 
> the same result, or modify our socket BIO to look at the first few bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (MESOS-10072) Windows: Curl requires zlib when built with SSL support on Windows

2019-12-17 Thread Joseph Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-10072:
--
Comment: was deleted

(was: Review guarding against usage of this feature, until we implement 
downgrade:
https://reviews.apache.org/r/71923/)

> Windows: Curl requires zlib when built with SSL support on Windows
> --
>
> Key: MESOS-10072
> URL: https://issues.apache.org/jira/browse/MESOS-10072
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Priority: Major
>  Labels: curl, foundations, windows
> Attachments: Screen Shot 2019-12-17 at 1.38.43 PM.png
>
>
> After building Windows with --enable-ssl, some curl-related tests, like 
> health check tests, start failing with the odd exit code {{-1073741515}}.
> Running curl directly with the Visual Studio debugger yields this error:
>  !Screen Shot 2019-12-17 at 1.38.43 PM.png|width=343,height=164!
> Some documentation online seems to support this additional requirement:
>  [https://wiki.dlang.org/Curl_on_Windows]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10072) Windows: Curl requires zlib when built with SSL support on Windows

2019-12-17 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998671#comment-16998671
 ] 

Joseph Wu commented on MESOS-10072:
---

Review guarding against usage of this feature, until we implement downgrade:
https://reviews.apache.org/r/71923/

> Windows: Curl requires zlib when built with SSL support on Windows
> --
>
> Key: MESOS-10072
> URL: https://issues.apache.org/jira/browse/MESOS-10072
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Priority: Major
>  Labels: curl, foundations, windows
> Attachments: Screen Shot 2019-12-17 at 1.38.43 PM.png
>
>
> After building Windows with --enable-ssl, some curl-related tests, like 
> health check tests, start failing with the odd exit code {{-1073741515}}.
> Running curl directly with the Visual Studio debugger yields this error:
>  !Screen Shot 2019-12-17 at 1.38.43 PM.png|width=343,height=164!
> Some documentation online seems to support this additional requirement:
>  [https://wiki.dlang.org/Curl_on_Windows]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10073) Implement SSL downgrade on the native SSL socket

2019-12-17 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10073:
-

 Summary: Implement SSL downgrade on the native SSL socket
 Key: MESOS-10073
 URL: https://issues.apache.org/jira/browse/MESOS-10073
 Project: Mesos
  Issue Type: Task
Reporter: Joseph Wu


The new SSL socket implementation (the non-libevent one) does not currently 
implement the SSL downgrade hack.  We could probably use {{peek}} to achieve 
the same result, or modify our socket BIO to look at the first few bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10072) Windows: Curl requires zlib when built with SSL support on Windows

2019-12-17 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10072:
-

 Summary: Windows: Curl requires zlib when built with SSL support 
on Windows
 Key: MESOS-10072
 URL: https://issues.apache.org/jira/browse/MESOS-10072
 Project: Mesos
  Issue Type: Task
Reporter: Joseph Wu
 Attachments: Screen Shot 2019-12-17 at 1.38.43 PM.png

After building Windows with --enable-ssl, some curl-related tests, like health 
check tests, start failing with the odd exit code {{-1073741515}}.

Running curl directly with the Visual Studio debugger yields this error:
 !Screen Shot 2019-12-17 at 1.38.43 PM.png! 

Some documentation online seems to support this additional requirement:
https://wiki.dlang.org/Curl_on_Windows



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10071) Add a timeout on SSL listening sockets for sockets that never complete handshaking

2019-12-16 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10071:
-

 Summary: Add a timeout on SSL listening sockets for sockets that 
never complete handshaking
 Key: MESOS-10071
 URL: https://issues.apache.org/jira/browse/MESOS-10071
 Project: Mesos
  Issue Type: Task
Reporter: Joseph Wu


Right now, if a plain socket makes a connection to an SSL server socket, but 
the plain socket never transmits any data, the server side will keep the 
connection open indefinitely.  We should consider adding a timeout (or other 
limit) to prevent a build-up of invalid sockets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10070) Add a unit test for client-initiated SSL renegotiation

2019-12-16 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10070:
-

 Summary: Add a unit test for client-initiated SSL renegotiation
 Key: MESOS-10070
 URL: https://issues.apache.org/jira/browse/MESOS-10070
 Project: Mesos
  Issue Type: Task
Reporter: Joseph Wu


https://www.openssl.org/docs/man1.1.1/man3/SSL_renegotiate.html

On certain versions of TLS, the client can attempt to renegotiate an existing 
SSL connection at any time.  This basically means performing an SSL handshake 
again on the same connection.

To ensure our sockets don't break when this happens, we should add a unit test 
exercising the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10069) Consider modifying Process_BENCHMARK_ClientServer to compare socker performance on the new OpenSSL socket implementation

2019-12-16 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10069:
-

 Summary: Consider modifying Process_BENCHMARK_ClientServer to 
compare socker performance on the new OpenSSL socket implementation
 Key: MESOS-10069
 URL: https://issues.apache.org/jira/browse/MESOS-10069
 Project: Mesos
  Issue Type: Task
Reporter: Joseph Wu


There is a pre-existing benchmark in the libprocess benchmarks.cpp file called 
{{Process_BENCHMARK_ClientServer}}.  We could have this benchmark make HTTPS 
connections en-masse as well, to check performance differences between the 
different implementations of our sockets.

We will have the following implementations:
* Libevent + OpenSSL (Posix & Windows)
* Libev + OpenSSL (Posix)
* Native Windows event loop + OpenSSL

Since the new OpenSSL socket defers work off onto libprocess worker threads, it 
will be interesting to see if performance improves or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10010) Implement an SSL socket for Windows, using OpenSSL directly

2019-10-23 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958389#comment-16958389
 ] 

Joseph Wu commented on MESOS-10010:
---

Reviews will start here: 
https://reviews.apache.org/r/71665/

So far, this only implements the {{connect}} function.

> Implement an SSL socket for Windows, using OpenSSL directly
> ---
>
> Key: MESOS-10010
> URL: https://issues.apache.org/jira/browse/MESOS-10010
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> {code}
> class WindowsSSLSocketImpl : public SocketImpl
> {
> public:
>   // This will be the entry point for Socket::create(SSL).
>   static Try> create(int_fd s);
>   WindowsSSLSocketImpl(int_fd _s);
>   ~WindowsSSLSocketImpl() override;
>   // Overrides for the 'SocketImpl' interface below.
>   // Unreachable.
>   Future connect(const Address& address) override;
>   // This will initialize SSL objects then call windows::connect()
>   // and chain that onto the appropriate call to SSL_do_handshake.
>   Future connect(
>   const Address& address,
>   const openssl::TLSClientConfig& config) override;
>   // These will call SSL_read or SSL_write as appropriate.
>   // As long as the SSL context is set up correctly, these will be
>   // thin wrappers.  (More details after the code block.)
>   Future recv(char* data, size_t size) override;
>   Future send(const char* data, size_t size) override;
>   Future sendfile(int_fd fd, off_t offset, size_t size) override;
>   // Nothing SSL here, just a plain old listener.
>   Try listen(int backlog) override;
>   // This will initialize SSL objects then call windows::accept()
>   // and then perform handshaking.  Any downgrading will
>   // happen here.  Since we control the event loop, we can
>   // easily peek at the first few bytes to check SSL-ness.
>   Future> accept() override;
>   SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10012) Implement SSL socket downgrading on the native Windows SSL socket.

2019-10-16 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10012:
-

 Summary: Implement SSL socket downgrading on the native Windows 
SSL socket.
 Key: MESOS-10012
 URL: https://issues.apache.org/jira/browse/MESOS-10012
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Joseph Wu
Assignee: Joseph Wu


The logic needed to determine whether a connection is SSL or not is already 
established in the libevent SSL socket:
{code}
  // Based on the function 'ssl23_get_client_hello' in openssl, we
  // test whether to dispatch to the SSL or non-SSL based accept based
  // on the following rules:
  //   1. If there are fewer than 3 bytes: non-SSL.
  //   2. If the 1st bit of the 1st byte is set AND the 3rd byte is
  //  equal to SSL2_MT_CLIENT_HELLO: SSL.
  //   3. If the 1st byte is equal to SSL3_RT_HANDSHAKE AND the 2nd
  //  byte is equal to SSL3_VERSION_MAJOR and the 6th byte is
  //  equal to SSL3_MT_CLIENT_HELLO: SSL.
  //   4. Otherwise: non-SSL.

  // For an ascii based protocol to falsely get dispatched to SSL it
  // needs to:
  //   1. Start with an invalid ascii character (0x80).
  //   2. OR have the first 2 characters be a SYN followed by ETX, and
  //  then the 6th character be SOH.
  // These conditions clearly do not constitute valid HTTP requests,
  // and are unlikely to collide with other existing protocols.

  bool ssl = false; // Default to rule 4.

  if (size < 2) { // Rule 1.
ssl = false;
  } else if ((data[0] & 0x80) && data[2] == SSL2_MT_CLIENT_HELLO) { // Rule 2.
ssl = true;
  } else if (data[0] == SSL3_RT_HANDSHAKE &&
 data[1] == SSL3_VERSION_MAJOR &&
 data[5] == SSL3_MT_CLIENT_HELLO) { // Rule 3.
ssl = true;
  }
{code}

This only requires us to peek at the first 6 bytes of data.  One possible 
complication is that Overlapped sockets do not support peeking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10010) Implement an SSL socket for Windows, using OpenSSL directly

2019-10-16 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953113#comment-16953113
 ] 

Joseph Wu edited comment on MESOS-10010 at 10/16/19 7:02 PM:
-

Once the BIO (MESOS-10009) is complete, this part will boil down to 
implementing the SSL handshake (i.e. putting {{SSL_do_handshake}} in the right 
places).  I can probably only split out the SSL downgrade feature into a 
separate ticket.


was (Author: kaysoky):
Once the BIO (MESOS-10009) is complete, this part will boil down to 
implementing the SSL handshake.  I can probably only split out the SSL 
downgrade feature into a separate ticket.

> Implement an SSL socket for Windows, using OpenSSL directly
> ---
>
> Key: MESOS-10010
> URL: https://issues.apache.org/jira/browse/MESOS-10010
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> {code}
> class WindowsSSLSocketImpl : public SocketImpl
> {
> public:
>   // This will be the entry point for Socket::create(SSL).
>   static Try> create(int_fd s);
>   WindowsSSLSocketImpl(int_fd _s);
>   ~WindowsSSLSocketImpl() override;
>   // Overrides for the 'SocketImpl' interface below.
>   // Unreachable.
>   Future connect(const Address& address) override;
>   // This will initialize SSL objects then call windows::connect()
>   // and chain that onto the appropriate call to SSL_do_handshake.
>   Future connect(
>   const Address& address,
>   const openssl::TLSClientConfig& config) override;
>   // These will call SSL_read or SSL_write as appropriate.
>   // As long as the SSL context is set up correctly, these will be
>   // thin wrappers.  (More details after the code block.)
>   Future recv(char* data, size_t size) override;
>   Future send(const char* data, size_t size) override;
>   Future sendfile(int_fd fd, off_t offset, size_t size) override;
>   // Nothing SSL here, just a plain old listener.
>   Try listen(int backlog) override;
>   // This will initialize SSL objects then call windows::accept()
>   // and then perform handshaking.  Any downgrading will
>   // happen here.  Since we control the event loop, we can
>   // easily peek at the first few bytes to check SSL-ness.
>   Future> accept() override;
>   SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10010) Implement an SSL socket for Windows, using OpenSSL directly

2019-10-16 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953113#comment-16953113
 ] 

Joseph Wu commented on MESOS-10010:
---

Once the BIO (MESOS-10009) is complete, this part will boil down to 
implementing the SSL handshake.  I can probably only split out the SSL 
downgrade feature into a separate ticket.

> Implement an SSL socket for Windows, using OpenSSL directly
> ---
>
> Key: MESOS-10010
> URL: https://issues.apache.org/jira/browse/MESOS-10010
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> {code}
> class WindowsSSLSocketImpl : public SocketImpl
> {
> public:
>   // This will be the entry point for Socket::create(SSL).
>   static Try> create(int_fd s);
>   WindowsSSLSocketImpl(int_fd _s);
>   ~WindowsSSLSocketImpl() override;
>   // Overrides for the 'SocketImpl' interface below.
>   // Unreachable.
>   Future connect(const Address& address) override;
>   // This will initialize SSL objects then call windows::connect()
>   // and chain that onto the appropriate call to SSL_do_handshake.
>   Future connect(
>   const Address& address,
>   const openssl::TLSClientConfig& config) override;
>   // These will call SSL_read or SSL_write as appropriate.
>   // As long as the SSL context is set up correctly, these will be
>   // thin wrappers.  (More details after the code block.)
>   Future recv(char* data, size_t size) override;
>   Future send(const char* data, size_t size) override;
>   Future sendfile(int_fd fd, off_t offset, size_t size) override;
>   // Nothing SSL here, just a plain old listener.
>   Try listen(int backlog) override;
>   // This will initialize SSL objects then call windows::accept()
>   // and then perform handshaking.  Any downgrading will
>   // happen here.  Since we control the event loop, we can
>   // easily peek at the first few bytes to check SSL-ness.
>   Future> accept() override;
>   SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10010) Implement an SSL socket for Windows, using OpenSSL directly

2019-10-09 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10010:
-

 Summary: Implement an SSL socket for Windows, using OpenSSL 
directly
 Key: MESOS-10010
 URL: https://issues.apache.org/jira/browse/MESOS-10010
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Joseph Wu
Assignee: Joseph Wu


{code}
class WindowsSSLSocketImpl : public SocketImpl
{
public:
  // This will be the entry point for Socket::create(SSL).
  static Try> create(int_fd s);

  WindowsSSLSocketImpl(int_fd _s);
  ~WindowsSSLSocketImpl() override;

  // Overrides for the 'SocketImpl' interface below.

  // Unreachable.
  Future connect(const Address& address) override;

  // This will initialize SSL objects then call windows::connect()
  // and chain that onto the appropriate call to SSL_do_handshake.
  Future connect(
  const Address& address,
  const openssl::TLSClientConfig& config) override;

  // These will call SSL_read or SSL_write as appropriate.
  // As long as the SSL context is set up correctly, these will be
  // thin wrappers.  (More details after the code block.)
  Future recv(char* data, size_t size) override;
  Future send(const char* data, size_t size) override;
  Future sendfile(int_fd fd, off_t offset, size_t size) override;

  // Nothing SSL here, just a plain old listener.
  Try listen(int backlog) override;

  // This will initialize SSL objects then call windows::accept()
  // and then perform handshaking.  Any downgrading will
  // happen here.  Since we control the event loop, we can
  // easily peek at the first few bytes to check SSL-ness.
  Future> accept() override;

  SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; }
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10009) Windows SSL: Implement glue code for the Windows event loop and OpenSSL's basic I/O abstraction

2019-10-09 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-10009:
-

 Summary: Windows SSL: Implement glue code for the Windows event 
loop and OpenSSL's basic I/O abstraction
 Key: MESOS-10009
 URL: https://issues.apache.org/jira/browse/MESOS-10009
 Project: Mesos
  Issue Type: Task
Reporter: Joseph Wu
Assignee: Joseph Wu


In order for the Windows event loop to pass data to the OpenSSL library, we 
will need some glue code in the form of a "BIO":
https://www.openssl.org/docs/man1.1.1/man7/bio.html

This will basically need to wrap the two {{windows::read}} and 
{{windows::write}} async I/O functions in the appropriate callbacks necessary 
for OpenSSL.  There are also a few other callbacks necessary.  This page 
contains the set of functions used to build up a new BIO type:
https://www.openssl.org/docs/man1.1.1/man3/BIO_meth_new.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10003) Design doc for SSL on Windows

2019-10-09 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948029#comment-16948029
 ] 

Joseph Wu commented on MESOS-10003:
---

For a Windows event loop OpenSSL socket implementation, we will need to create 
a new subclass for {{SocketImpl}}.

{code}
class WindowsSSLSocketImpl : public SocketImpl
{
public:
  // This will be the entry point for Socket::create(SSL).
  static Try> create(int_fd s);

  WindowsSSLSocketImpl(int_fd _s);
  ~WindowsSSLSocketImpl() override;

  // Overrides for the 'SocketImpl' interface below.

  // Unreachable.
  Future connect(const Address& address) override;

  // This will initialize SSL objects then call windows::connect()
  // and chain that onto the appropriate call to SSL_do_handshake.
  Future connect(
  const Address& address,
  const openssl::TLSClientConfig& config) override;

  // These will call SSL_read or SSL_write as appropriate.
  // As long as the SSL context is set up correctly, these will be
  // thin wrappers.  (More details after the code block.)
  Future recv(char* data, size_t size) override;
  Future send(const char* data, size_t size) override;
  Future sendfile(int_fd fd, off_t offset, size_t size) override;

  // Nothing SSL here, just a plain old listener.
  Try listen(int backlog) override;

  // This will initialize SSL objects then call windows::accept()
  // and then perform handshaking.  Any downgrading will
  // happen here.  Since we control the event loop, we can
  // easily peek at the first few bytes to check SSL-ness.
  Future> accept() override;

  SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; }
}
{code}

To set up the SSL context to use the Windows event loop, we will need to 
replace {{BIO_new_socket}} with a custom BIO wrapping our event loop's I/O 
methods (windows::read and windows::write).  This is not complicated 
(implementation of some callbacks).  And libevent has an example of this too, 
if needed.

> Design doc for SSL on Windows
> -
>
> Key: MESOS-10003
> URL: https://issues.apache.org/jira/browse/MESOS-10003
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10003) Design doc for SSL on Windows

2019-10-09 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948020#comment-16948020
 ] 

Joseph Wu edited comment on MESOS-10003 at 10/9/19 9:33 PM:


This is less of a design doc, and more of a design blurb, because the task here 
is to use the OpenSSL library directly.

Before proceeding, we will use OpenSSL's Basic I/O (BIO) abstraction a lot in 
this blurb, so reading this overview will help:
https://www.openssl.org/docs/man1.1.1/man7/bio.html

Our reference implementation is from libevent:
https://github.com/libevent/libevent/blob/master/bufferevent_openssl.c

We do not use all aspects of Libevent's SSL implementation.  We use the 
{{bufferevent_openssl_socket_new}} method, which is a thin wrapper around 
OpenSSL's {{BIO_new_socket}} and {{SSL_set_bio}} methods.
https://github.com/libevent/libevent/blob/master/bufferevent_openssl.c#L1441

{{BIO_new_socket}} takes a socket and transforms it into a source/sink BIO, 
while {{SSL_set_bio}} takes an SSL context and assigns the BIO to it, which 
allows use of methods like {{SSL_read}} and {{SSL_write}}.

Libevent also wraps a call to {{SSL_do_handshake}} when initializing a socket.

The role of libevent is to space out calls to read/write based on the 
bufferevents we give libevent.


was (Author: kaysoky):
This is less of a design doc, and more of a design blurb, because the task here 
is to use the OpenSSL library directly.

Before proceeding, we will use OpenSSL's Basic I/O (BIO) abstraction a lot in 
this blurb, so reading this overview will help:
https://www.openssl.org/docs/man1.1.1/man7/bio.html

Our reference implementation is from libevent:
https://github.com/libevent/libevent/blob/master/bufferevent_openssl.c

We do not use all aspects of Libevent's SSL implementation.  We use the 
{{bufferevent_openssl_socket_new}} method, which is a thin wrapper around 
OpenSSL's {{BIO_new_socket}} and {{SSL_set_bio}} methods.
https://github.com/libevent/libevent/blob/master/bufferevent_openssl.c#L1441

{{BIO_new_socket}} takes a socket and transforms it into a source/sink BIO, 
while {{SSL_set_bio}} takes an SSL context and assigns the BIO to it, which 
allows use of methods like {{SSL_read}} and {{SSL_write}}.

The role of libevent is to space out calls to read/write based on the 
bufferevents we give libevent.

> Design doc for SSL on Windows
> -
>
> Key: MESOS-10003
> URL: https://issues.apache.org/jira/browse/MESOS-10003
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10003) Design doc for SSL on Windows

2019-10-09 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948020#comment-16948020
 ] 

Joseph Wu commented on MESOS-10003:
---

This is less of a design doc, and more of a design blurb, because the task here 
is to use the OpenSSL library directly.

Before proceeding, we will use OpenSSL's Basic I/O (BIO) abstraction a lot in 
this blurb, so reading this overview will help:
https://www.openssl.org/docs/man1.1.1/man7/bio.html

Our reference implementation is from libevent:
https://github.com/libevent/libevent/blob/master/bufferevent_openssl.c

We do not use all aspects of Libevent's SSL implementation.  We use the 
{{bufferevent_openssl_socket_new}} method, which is a thin wrapper around 
OpenSSL's {{BIO_new_socket}} and {{SSL_set_bio}} methods.
https://github.com/libevent/libevent/blob/master/bufferevent_openssl.c#L1441

{{BIO_new_socket}} takes a socket and transforms it into a source/sink BIO, 
while {{SSL_set_bio}} takes an SSL context and assigns the BIO to it, which 
allows use of methods like {{SSL_read}} and {{SSL_write}}.

The role of libevent is to space out calls to read/write based on the 
bufferevents we give libevent.

> Design doc for SSL on Windows
> -
>
> Key: MESOS-10003
> URL: https://issues.apache.org/jira/browse/MESOS-10003
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10004) Enable SSL on Windows

2019-10-08 Thread Joseph Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-10004:
-

Assignee: Joseph Wu

> Enable SSL on Windows
> -
>
> Key: MESOS-10004
> URL: https://issues.apache.org/jira/browse/MESOS-10004
> Project: Mesos
>  Issue Type: Epic
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10003) Design doc for SSL on Windows

2019-10-08 Thread Joseph Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-10003:
-

Assignee: Joseph Wu

> Design doc for SSL on Windows
> -
>
> Key: MESOS-10003
> URL: https://issues.apache.org/jira/browse/MESOS-10003
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9971) 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so fail on Windows/MSVC.

2019-09-26 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938705#comment-16938705
 ] 

Joseph Wu commented on MESOS-9971:
--

Oh, good catch.  We moved that file to {{support/setup-dev.bat}} as part of 
this review: https://reviews.apache.org/r/71299/

If you are purely building Mesos, and not developing, then you can simply 
remove that step from your build process.  

> 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so 
> fail on Windows/MSVC.
> ---
>
> Key: MESOS-9971
> URL: https://issues.apache.org/jira/browse/MESOS-9971
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
> Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color}
>Reporter: LinGao
>Assignee: Joseph Wu
>Priority: Trivial
>  Labels: foundations
> Fix For: 1.10.0
>
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on 
> Windows using MSVC. It can be first reproduced on 
> {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please 
> take a look at this isssue? Thanks a lot!
> Reproduce steps:
> 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  3. cd src
>  4. .\bootstrap.bat
>  5. cd ..
>  6. mkdir build_x64 && pushd build_x64
>  7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
>  8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> 67>PrepareForBuild:
>  Creating directory "x64\Debug\dist\dist.tlog\".
>    InitializeBuildStatus:
>  Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because 
> "AlwaysCreate" was specified.
> 67>C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5):
>  error MSB6006: "cmd.exe" exited with code 1. 
> [D:\Mesos\build_x64\dist.vcxproj]
> 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild 
> target(s)) -- FAILED.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-9977) Agent does not check for immutable files while removing persistent volumes (and possibly in other GC operations)

2019-09-21 Thread Joseph Wu (Jira)
Joseph Wu created MESOS-9977:


 Summary: Agent does not check for immutable files while removing 
persistent volumes (and possibly in other GC operations)
 Key: MESOS-9977
 URL: https://issues.apache.org/jira/browse/MESOS-9977
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.9.0, 1.8.1, 1.7.2, 1.6.2
Reporter: Joseph Wu


We observed an exit/crash loop on an agent originating from deleting a 
persistent volume:
{code}
slave.cpp:4557] Deleting persistent volume '' at 
'/path/to/mesos/slave/volumes/roles/my-role/'
{code}

This persistent volume happened to have one (or more) files within marked as 
{{immutable}}.

When the agent went to delete this persistent volume, via {{os::rmdir(...)}}, 
it encountered these immutable file(s) and exits like:
{code}
slave.cpp:4423] EXIT with status 1: Failed to sync checkpointed resources: 
Failed to remove persistent volume '' at 
'/path/to/mesos/slave/volumes/roles/my-role/': Operation not permitted
{code}

The agent would then be unable to start up again, because during recovery, the 
agent would attempt to delete the same persistent volume and fail to do so.

Manually removing the immutable attribute from files within the persistent 
volume allows the agent to recover:
{code}
chattr -R -i /path/to/mesos/slave/volumes/roles/my-role/
{code}

Immutable attributes can be easily introduced by any tasks running on the 
agent.  As long as the task has sufficient permissions, it could easily call 
{{chattr +i ...}}.  This attribute could also affect sandbox GC, which also 
uses {{os::rmdir}} to clean up.  However, sandbox GC tends to warn rather than 
exit on failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9971) 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so fail on Windows/MSVC.

2019-09-18 Thread Joseph Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9971:


  Sprint: Foundations: RI-18 55
Story Points: 1
Assignee: Joseph Wu
  Labels: foundations  (was: )
Priority: Trivial  (was: Major)

> 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so 
> fail on Windows/MSVC.
> ---
>
> Key: MESOS-9971
> URL: https://issues.apache.org/jira/browse/MESOS-9971
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
> Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color}
>Reporter: LinGao
>Assignee: Joseph Wu
>Priority: Trivial
>  Labels: foundations
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on 
> Windows using MSVC. It can be first reproduced on 
> {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please 
> take a look at this isssue? Thanks a lot!
> Reproduce steps:
> 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  3. cd src
>  4. .\bootstrap.bat
>  5. cd ..
>  6. mkdir build_x64 && pushd build_x64
>  7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
>  8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> 67>PrepareForBuild:
>  Creating directory "x64\Debug\dist\dist.tlog\".
>    InitializeBuildStatus:
>  Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because 
> "AlwaysCreate" was specified.
> 67>C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5):
>  error MSB6006: "cmd.exe" exited with code 1. 
> [D:\Mesos\build_x64\dist.vcxproj]
> 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild 
> target(s)) -- FAILED.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9971) 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so fail on Windows/MSVC.

2019-09-18 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932768#comment-16932768
 ] 

Joseph Wu commented on MESOS-9971:
--

Disabling the targets: https://reviews.apache.org/r/71507/

> 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so 
> fail on Windows/MSVC.
> ---
>
> Key: MESOS-9971
> URL: https://issues.apache.org/jira/browse/MESOS-9971
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
> Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color}
>Reporter: LinGao
>Priority: Major
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on 
> Windows using MSVC. It can be first reproduced on 
> {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please 
> take a look at this isssue? Thanks a lot!
> Reproduce steps:
> 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  3. cd src
>  4. .\bootstrap.bat
>  5. cd ..
>  6. mkdir build_x64 && pushd build_x64
>  7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
>  8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> 67>PrepareForBuild:
>  Creating directory "x64\Debug\dist\dist.tlog\".
>    InitializeBuildStatus:
>  Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because 
> "AlwaysCreate" was specified.
> 67>C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5):
>  error MSB6006: "cmd.exe" exited with code 1. 
> [D:\Mesos\build_x64\dist.vcxproj]
> 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild 
> target(s)) -- FAILED.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9971) Mesos failed to build due to error MSB6006 on Windows with MSVC.

2019-09-18 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932759#comment-16932759
 ] 

Joseph Wu commented on MESOS-9971:
--

The {{dist}} and {{distcheck}} targets, which were recently added to mirror 
those targets from the autotools build, are currently implemented as {{.sh}} 
scripts, and are not expected to work on Windows.  

I'll consider removing those targets from the Windows build, if we deem the 
feature (making a clean source package) unnecessary for the Windows build.

> Mesos failed to build due to error MSB6006 on Windows with MSVC.
> 
>
> Key: MESOS-9971
> URL: https://issues.apache.org/jira/browse/MESOS-9971
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
> Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color}
>Reporter: LinGao
>Priority: Major
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on 
> Windows using MSVC. It can be first reproduced on 
> {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please 
> take a look at this isssue? Thanks a lot!
> Reproduce steps:
> 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  3. cd src
>  4. .\bootstrap.bat
>  5. cd ..
>  6. mkdir build_x64 && pushd build_x64
>  7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
>  8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> 67>PrepareForBuild:
>  Creating directory "x64\Debug\dist\dist.tlog\".
>    InitializeBuildStatus:
>  Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because 
> "AlwaysCreate" was specified.
> 67>C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5):
>  error MSB6006: "cmd.exe" exited with code 1. 
> [D:\Mesos\build_x64\dist.vcxproj]
> 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild 
> target(s)) -- FAILED.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9892) Test various agent state transitions involving agent draining

2019-08-19 Thread Joseph Wu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910783#comment-16910783
 ] 

Joseph Wu commented on MESOS-9892:
--

Cases for (Nothing running on agent, 1 task running on agent) x (Normal drain, 
drain + mark gone, drain while agent disconnected, drain while agent 
unreachable).

Starting here: https://reviews.apache.org/r/71314/

> Test various agent state transitions involving agent draining
> -
>
> Key: MESOS-9892
> URL: https://issues.apache.org/jira/browse/MESOS-9892
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We should add tests which verify correct behavior in the various cases of 
> transitions between different agent states and the DRAINING or DRAINED states.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9934) Master does not handle returning unreachable agents as draining/deactivated

2019-08-12 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9934:


 Summary: Master does not handle returning unreachable agents as 
draining/deactivated
 Key: MESOS-9934
 URL: https://issues.apache.org/jira/browse/MESOS-9934
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Joseph Wu
Assignee: Joseph Wu


The master has two code paths for handling agent reregistration messages, one 
culminating in {{Master::___reregisterSlave}} and the other in 
{{Master::}}{{__reregisterSlave}}. The two paths are not continuations of each 
other.  Looks like we missed the double-underscore case in the initial 
implementation.  This is the path that unreachable agents take, when/if they 
come back to the cluster.  The result is that when unreachable agents are 
marked for draining, they do not get sent the appropriate message unless they 
are forced to reregister again (i.e. restarted manually).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9899) Using a symlink as the agent's work directory results in non-removal of persistent volume mounts.

2019-07-19 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9899:


 Summary: Using a symlink as the agent's work directory results in 
non-removal of persistent volume mounts.
 Key: MESOS-9899
 URL: https://issues.apache.org/jira/browse/MESOS-9899
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.8.0, 1.7.0, 1.6.0
Reporter: Joseph Wu


The directory layout of the agent's information places created persistent 
volumes under the agent's {{--work_dir}}:
{code}
//   root ('--work_dir' flag)
//   |-- volumes
//   |   |-- roles
//   |   |-- 
//   |   |--  (persistent volume)
{code}

When these persistent volumes are used, they will (on Linux) generally be 
mounted underneath the sandbox directory (also located under {{--work_dir}}).  
Upon termination of use, persistent volumes are unmounted by reading the mount 
table, and checking if any mount targets are under the sandbox:
{code}
  // Reverse unmount order to handle nested mount points.
  foreach (const fs::MountInfoTable::Entry& entry,
   adaptor::reverse(table->entries)) {
// NOTE: All persistent volumes are mounted at targets under the
// container's work directory. We unmount all the persistent
// volumes before unmounting the sandbox/work directory mount.
if (strings::startsWith(entry.target, sandbox)) {
  LOG(INFO) << "Unmounting volume '" << entry.target
<< "' for container " << containerId;
{code}

---

However, when an agent's work directory is placed under a symlink, the same 
code above might not find any persistent volumes to remove.  This is because 
the mount table shows the real location on disk, but the sandbox expects the 
symlinked location.

For example, suppose:
* The {{--work_dir}} is {{/var/run/mesos}}.
* {{/var/run/mesos}} is a symlink pointing to {{/tmp/link}}.

The agent will create sandboxes under paths like 
{{/var/run/mesos/slave/.../framework/.../...}}.  The mount table however, will 
show mount targets like {{/tmp/link/slave/.../framework/.../...}}.  Since the 
mount table target does not start with the sandbox path, the 
{{filesystem/linux}} isolator will not find any persistent volumes to clean up. 
 The agent's garbage collector will also fail here, because it tries to unmount 
any persistent volumes under the agent's work directory.
{code}
  foreach (const fs::MountInfoTable::Entry& entry,
   adaptor::reverse(mountTable->entries)) {
// Ignore mounts whose targets are not under `workDir`.
if (!strings::startsWith(
path::join(entry.target, ""),
path::join(_workDir, ""))) {
continue;
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9892) Test various agent state transitions involving agent draining

2019-07-17 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9892:


Assignee: Joseph Wu

> Test various agent state transitions involving agent draining
> -
>
> Key: MESOS-9892
> URL: https://issues.apache.org/jira/browse/MESOS-9892
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We should add tests which verify correct behavior in the various cases of 
> transitions between different agent states and the DRAINING or DRAINED states.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9894) Mesos failed to build due to fatal error C1083 on Windows using MSVC.

2019-07-17 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9894:


Assignee: Joseph Wu
  Sprint: Mesos Foundations: RI-16 51
Story Points: 1
  Labels: foundations mesosphere  (was: )

> Mesos failed to build due to fatal error C1083 on Windows using MSVC.
> -
>
> Key: MESOS-9894
> URL: https://issues.apache.org/jira/browse/MESOS-9894
> Project: Mesos
>  Issue Type: Bug
>  Components: build
> Environment: VS 2017 + Windows Server 2016
>Reporter: LinGao
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to fatal error C1083: Cannot open include file: 
> 'slave/volume_gid_manager/state.pb.h': No such file or directory on Windows 
> using MSVC. It can be first reproduced on 6a026e3 reversion on master branch. 
> Could you please take a look at this isssue? Thanks a lot!
> Reproduce steps:
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> D:\mesos\src
> 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
> 3. cd src
> 4. .\bootstrap.bat
> 5. cd ..
> 6. mkdir build_x64 && pushd build_x64
> 7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> D:\Mesos\src\include\mesos/docker/spec.hpp(29): fatal error C1083: Cannot 
> open include file: 'mesos/docker/spec.pb.h': No such file or directory
> D:\Mesos\src\src\slave/volume_gid_manager/state.hpp(21): fatal error C1083: 
> Cannot open include file: 'slave/volume_gid_manager/state.pb.h': No such file 
> or directory
> D:\Mesos\src\src\slave/volume_gid_manager/state.hpp(21): fatal error C1083: 
> Cannot open include file: 'slave/volume_gid_manager/state.pb.h': No such file 
> or directory
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9817) Add minimum master capability for draining and deactivation states

2019-06-04 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9817:


 Summary: Add minimum master capability for draining and 
deactivation states
 Key: MESOS-9817
 URL: https://issues.apache.org/jira/browse/MESOS-9817
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Joseph Wu


Since we are adding new fields to the registry to represent agent 
draining/deactivation, we cannot allow downgrades of masters while such 
features are in use.  

A new minimum capability should be added to the registry with the appropriate 
documentation:
https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/src/master/master.cpp#L1681-L1688
http://mesos.apache.org/documentation/latest/downgrades/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9816) Add draining state information to master event stream and state endpoints

2019-06-04 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9816:


 Summary: Add draining state information to master event stream and 
state endpoints
 Key: MESOS-9816
 URL: https://issues.apache.org/jira/browse/MESOS-9816
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Joseph Wu


The response for {{GET_STATE}} and {{GET_AGENTS}} should include the new fields 
indicating deactivation or draining states:
{code}
message Response {
  . . .

  message GetAgents {
message Agent {
  . . .

  optional bool deactivated = 12;
  optional DrainInfo drain_info = 13;

  . . .
}
  }
  . . .
}
{code}

Additionally, the master's event stream should get a new event whenever these 
states change:
{code}
message Event {
  . . .

  enum Type {
. . .

AGENT_UPDATED = 10;
  }

  message AgentUpdated {
optional bool deactivated = 1;
optional DrainInfo drain_info = 2;
  }

  . . .

  optional AgentUpdated agent_updated = 10;
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9814) Implement DrainAgent master/operator call with associated registry actions

2019-06-04 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9814:


 Summary: Implement DrainAgent master/operator call with associated 
registry actions
 Key: MESOS-9814
 URL: https://issues.apache.org/jira/browse/MESOS-9814
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Joseph Wu


We want to add several calls associated with agent draining:
{code}
message Call {
  enum Type {

. . .

DRAIN_AGENT = 37;
DEACTIVATE_AGENT = 38;
REACTIVATE_AGENT = 39;
  }

  . . .

  message DrainAgents {
message DrainConfig {
  required AgentID agent = 1;

  // The duration after which the agent should complete draining.
  // If tasks are still running after this time, they will
  // be forcefully terminated.
  optional Duration max_grace_period = 2;

  // Whether or not this agent will be removed permanently
  // from the cluster when draining is complete.
  optional bool destructive = 3 [default = false];
}

repeated DrainConfig drain_config = 1;
  }

  message DeactivateAgents {
repeated AgentID agents = 1;
  }

  message ReactivateAgents {
repeated AgentID agents = 1;
  }
}
{code}

Each field will be persisted in the registry:
{code}
message Registry {

  . . .

  message Slave {
. . .

optional DrainInfo drain_info = 2;
  }

  . . .

  message UnreachableSlave {
. . .

optional DrainInfo drain_info = 3;
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues

2019-05-28 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850170#comment-16850170
 ] 

Joseph Wu commented on MESOS-9800:
--

I believe this error appears when the locale of the program (Mesos agent) is 
set to some non-UTF locale, such as the default POSIX locale.  The addition of 
the following line _might_ be enough to fix it.

{code}
diff --git a/3rdparty/stout/include/stout/archiver.hpp 
b/3rdparty/stout/include/stout/archiver.hpp
index 551e644a3..706ba5282 100644
--- a/3rdparty/stout/include/stout/archiver.hpp
+++ b/3rdparty/stout/include/stout/archiver.hpp
@@ -54,6 +54,11 @@ inline Try extract(
   archive_read_support_format_all(reader.get());
   archive_read_support_filter_all(reader.get());
 
+  // Prevent Libarchive from trying to convert filenames to the locale-default
+  // character encoding. This conversion sometimes fails, for example when
+  // reading UTF-8 characters in a standard "C" locale (POSIX default).
+  archive_read_set_options(reader.get(), "hdrcharset=BINARY");
+
   std::unique_ptr> writer(
 archive_write_disk_new(),
 [](struct archive* p) {
{code}

Could you attach an example archive that fails to extract and the locale of the 
machine running your agent (i.e. the output of the {{locale}} command)?  That 
should give me a better idea of what case is failing.

> libarchive cannot extract tarfile due to UTF-8 encoding issues
> --
>
> Key: MESOS-9800
> URL: https://issues.apache.org/jira/browse/MESOS-9800
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7.2
> Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu 
> 16.04.
>Reporter: Felipe Alfaro Solana
>Priority: Major
>
> Starting with Mesos 1.7, the following change has been introduced:
>  * [MESOS-8064] - Mesos now requires libarchive to programmatically decode 
> .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 
> is bundled in Mesos.
> However, this version of libarchive which is used by the fetcher component in 
> Mesos has problems in dealing with archive files (.tar and .zip) which 
> contain UTF-8 characters. We run Marahton on top of Mesos, and one of our 
> Marathon application relies on a .tar file which contains symlinks whose 
> target contains certain UTF-8 characters (Turkish) or the symlink name itself 
> contains UTF-8 characters. Mesos fetcher is unable to extract the archive and 
> fails with the following error:
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.791250  6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch 
> '/tmp/certificates.tar.gz': Failed to extract archive 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz'
>  to 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0':
>  Failed to read archive header: Linkname can't be converted from UTF-8 to 
> current locale.}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End 
> fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.846695  4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to 
> fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited 
> with status 1}}
> The same Marathon application works fine with Mesos 1.6 which does not use 
> libarchive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9798) How to reduce compile time after had changed/improved source code?

2019-05-28 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849982#comment-16849982
 ] 

Joseph Wu commented on MESOS-9798:
--

This is an architectural problem with how large is the codebase is, and how the 
code is structured.  The Mesos agent takes the longest to compile, mostly 
because 80% or so of the Mesos source files are compiled into the agent.  That 
includes {{src/docker/docker.hpp}}.

> How to reduce compile time after had changed/improved source code?
> --
>
> Key: MESOS-9798
> URL: https://issues.apache.org/jira/browse/MESOS-9798
> Project: Mesos
>  Issue Type: Improvement
>  Components: cmake
>Affects Versions: 1.8.0
> Environment: Linux firework-vm01 4.9.0-9-amd64 #1 SMP Debian 
> 4.9.168-1+deb9u2 (2019-05-13) x86_64 GNU/Linux
>Reporter: chatsiri
>Priority: Minor
>  Labels: newbie
>
> Hello all, 
>      I'm have changed variables in src/ directory finished, but compiler 
> using long time to finished build steps. How can reduces compile time per 
> component or source directory? Such as an simple steps below
>  # I'm add new member function to class Docker on docker.hpp. This class 
> declares on file at docker directory.
>  # Compile source again from build directory. This directory create on the 
> base source code directory same src/ , bin/ and include/.
>  # Come to build path with 
>  ## $cd build
>  ## $../configure --disable-python --disable-java --enable-debug 
> --enable-fast-install
>  ## $make
>  ## $sudo make install.   
> In steps No.3. Compiler used long time compiles source code. How we can 
> reduce compile time per source directory that we had changed its?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error

2019-05-21 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845237#comment-16845237
 ] 

Joseph Wu commented on MESOS-9329:
--

Just to clarify, *older* versions of libevent do not have CMake build files.  
We could potentially bump libevent to 
[2.1.8-stable|https://github.com/libevent/libevent/releases/tag/release-2.1.8-stable]
 to get both autotools and CMake on the same version.

> CMake build on Fedora 28 fails due to libevent error
> 
>
> Key: MESOS-9329
> URL: https://issues.apache.org/jira/browse/MESOS-9329
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> Trying to build Mesos using cmake with the options 
> {noformat}
> cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1
> {noformat}
> fails due to the following:
> {noformat}
> [  1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  In function ‘bio_bufferevent_new’:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
>  error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
>   b->init = 0;
>^~
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  At top level:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1:
>  error: variable ‘methods_bufferevent’ has initializer but incomplete type
>  static BIO_METHOD methods_bufferevent = {
> [...]
> {noformat}
> Since the autotools build does not have issues when enabling libevent and 
> ssl, it seems most likely that the `libevent-2.1.5-beta` version used by 
> default in the cmake build is somehow connected to the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839419#comment-16839419
 ] 

Joseph Wu commented on MESOS-9749:
--

The agent ends up in a bad state because the stdout/err pipe gets filled, and 
therefore starts to block threads.  This can lead to unpredictable results 
(since we aren't sure which threads are blocked by IO).

If the logs are not written directly to journald, then you won't need a restart 
of the agent.  It should remain functional during the time journald is down.

Of course, restarting the agent is still an option.

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839408#comment-16839408
 ] 

Joseph Wu commented on MESOS-9749:
--

The default behavior of Mesos's logging is to write to stdout/stderr. When 
launching via systemd, this means you are writing to journald. And if journald 
is restarted, the pipe between the agent and journald would be broken. These 
sorts of broken pipes usually terminate the agent, but it seems to be different 
in systemd's case.
 See also: [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771122]

There are a variety of ways to get around this, basically involving writing 
logs to some other location:

---
 
h2. Built-in solutions

Mesos lets you write stdout/stderr to disk instead.  If you specify the 
{{--log_dir}} flag, Mesos will leverage glog's log writing behavior, which has 
some form of log rotation built in.  But unfortunately, this does not seem to 
bound the size of logs on disk, so you'd end up writing a script or such to 
clean up logs.

Besides that, you may modify your service file to write to something besides 
journald, such as syslog, or a file.
https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Logging%20and%20Standard%20Input/Output

h2. Other solutions

By the looks of your agent configuration, you are not averse to deploying 
modules ({{--modules='file:///etc/mesos-chef/slave-modules.json'}}).  In this 
case, you have some other options.

DC/OS uses a {{LogSink}} module (which is a Mesos Anonymous module implementing 
a glog module) to pipe logs to file, which are then rotated by another timer.
https://github.com/dcos/dcos-mesos-modules/tree/master/logsink

If the goal is to get logs into journald, across journald restarts, this is 
also possible with a {{LogSink}}.  This would entail using the journald C API, 
like {{sd_journal_send}}.  I believe this is capable of reconnecting after 
journald restarts.
https://www.freedesktop.org/software/systemd/man/sd_journal_print.html

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atlassian JIRA

[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839373#comment-16839373
 ] 

Joseph Wu commented on MESOS-9750:
--

Found one more code path where the agent's {{GET_STATE}} will return extraneous 
"launched_tasks".

This happens when a Framework or Master {{TEARDOWN}} call is used and the 
executor does not send a terminal status update in time.  This one does not 
require an agent restart/shutdown.
Also, this code path will result in an executor's checkpointed state looking 
identical to the agent shutdown case.  If the agent is restarted, the code in 
the above patch will be run to put the agent back into a consistent state.

Fix and test here: https://reviews.apache.org/r/70641/

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9751) Build mesos example not found

2019-05-01 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831178#comment-16831178
 ] 

Joseph Wu commented on MESOS-9751:
--

The important mesos binaries will be generated under 
{{/src/mesos-*}}.  The {{src/examples/}} folder only contains 
some shared libraries used by the example frameworks.

> Build mesos example not found
> -
>
> Key: MESOS-9751
> URL: https://issues.apache.org/jira/browse/MESOS-9751
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.6.1
>Reporter: darion yaphet
>Priority: Major
>
> I try to build mesos from source code using make I think it should be build 
> out a binary under src/examples, but I don't find it .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-04-30 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830821#comment-16830821
 ] 

Joseph Wu commented on MESOS-9750:
--

Preliminary fix and test here: https://reviews.apache.org/r/70577/

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-04-30 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9750:


 Summary: Agent V1 GET_STATE response may report a complete 
executor's tasks as non-terminal after a graceful agent shutdown
 Key: MESOS-9750
 URL: https://issues.apache.org/jira/browse/MESOS-9750
 Project: Mesos
  Issue Type: Bug
  Components: agent, executor
Affects Versions: 1.7.0, 1.6.0, 1.8.0
Reporter: Joseph Wu
Assignee: Joseph Wu


When the following steps occur:
1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
/master/machine/down).
2) The executor is sent a kill, and the agent counts down on 
{{executor_shutdown_grace_period}}.
3) The executor exits, before all terminal status updates reach the agent. This 
is more likely if {{executor_shutdown_grace_period}} passes.

This results in a completed executor, with non-terminal tasks (according to 
status updates).

When the agent starts back up, the completed executor will be recovered and 
shows up correctly  as a completed executor in {{/state}}.  However, if you 
fetch the V1 {{GET_STATE}} result, there will be an entry in {{launched_tasks}} 
even though nothing is running.
{code}
get_tasks {
  launched_tasks {
name: "test-task"
task_id {
  value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
}
framework_id {
  value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
}
executor_id {
  value: "default"
}
agent_id {
  value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
}
state: TASK_RUNNING
resources { ... }
resources { ... }
resources { ... }
resources { ... }
statuses {
  task_id {
value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
  }
  state: TASK_RUNNING
  agent_id {
value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
  }
  timestamp: 1556674758.2175469
  executor_id {
value: "default"
  }
  source: SOURCE_EXECUTOR
  uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
  container_status { ... }
}
  }
}
get_executors {
  completed_executors {
executor_info {
  executor_id {
value: "default"
  }
  command {
value: ""
  }
  framework_id {
value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
  }
}
  }
}
get_frameworks {
  completed_frameworks {
framework_info {
  user: "user"
  name: "default"
  id {
value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
  }
  checkpoint: true
  hostname: "localhost"
  principal: "test-principal"
  capabilities {
type: MULTI_ROLE
  }
  capabilities {
type: RESERVATION_REFINEMENT
  }
  roles: "*"
}
  }
}
{code}

This happens because we combine executors and completed executors when 
constructing the response.  The terminal task(s) with non-terminal updates 
appear under completed executors.
https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters

2019-04-24 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825261#comment-16825261
 ] 

Joseph Wu commented on MESOS-9740:
--

Yes.  We expect the upgrade to work for most people.  However, our test cluster 
had a relatively wide variety of tasks; and just a single bad framework, 
launching 1+ task on each agent, could cripple the upgrade.

I should clarify that this affects 1.8.x **masters**.  A 1.7.x agent _might_ 
have trouble registering with a 1.8.x master due to this bug.

> Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents 
> from reregistering with 1.8+ masters
> ---
>
> Key: MESOS-9740
> URL: https://issues.apache.org/jira/browse/MESOS-9740
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Joseph Wu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: foundations, mesosphere
>
> As part of MESOS-6874, the master now validates protobuf unions passed as 
> part of an {{ExecutorInfo::ContainerInfo}}.  This prevents a task from 
> specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the 
> {{docker}} field (which is then ignored by the agent).
> However, if a task was already launched with an invalid protobuf union, the 
> same validation will happen when the agent tries to reregister with the 
> master.  In this case, if the master is upgraded to validate protobuf unions, 
> the agent reregistration will be rejected.
> {code}
> master.cpp:7201] Dropping re-registration of agent at 
> slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: 
> Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the 
> field `docker` set.
> {code}
> This bug was found when upgrading a 1.7.x test cluster to 1.8.0.  When 
> MESOS-6874 was committed, I had assumed the invalid protobufs would be rare.  
> However, on the test cluster, 13/17 agents had at least one invalid 
> ContainerInfo when reregistering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters

2019-04-23 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9740:


 Summary: Invalid protobuf unions in ExecutorInfo::ContainerInfo 
will prevent agents from reregistering with 1.8+ masters
 Key: MESOS-9740
 URL: https://issues.apache.org/jira/browse/MESOS-9740
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Joseph Wu
Assignee: Benno Evers


As part of MESOS-6874, the master now validates protobuf unions passed as part 
of an {{ExecutorInfo::ContainerInfo}}.  This prevents a task from specifying, 
for example, a {{ContainerInfo::MESOS}}, but filling out the {{docker}} field 
(which is then ignored by the agent).

However, if a task was already launched with an invalid protobuf union, the 
same validation will happen when the agent tries to reregister with the master. 
 In this case, if the master is upgraded to validate protobuf unions, the agent 
reregistration will be rejected.

{code}
master.cpp:7201] Dropping re-registration of agent at 
slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: 
Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the 
field `docker` set.
{code}

This bug was found when upgrading a 1.7.x test cluster to 1.8.0.  When 
MESOS-6874 was committed, I had assumed the invalid protobufs would be rare.  
However, on the test cluster, 13/17 agents had at least one invalid 
ContainerInfo when reregistering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5828) Modularize Network in replicated_log

2019-04-22 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823434#comment-16823434
 ] 

Joseph Wu commented on MESOS-5828:
--

Progress on this has been paused for a while (although the bulk of the patches 
are still usable).

In the meantime, you can try using zetcd, which basically exposes a ZK API for 
etcd:
https://github.com/etcd-io/zetcd

See this thread too: 
https://issues.apache.org/jira/browse/MESOS-1806?focusedCommentId=15895593=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15895593

> Modularize Network in replicated_log
> 
>
> Key: MESOS-5828
> URL: https://issues.apache.org/jira/browse/MESOS-5828
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Jay Guo
>Assignee: Jay Guo
>Priority: Major
>
> Currently replicated_log relies on Zookeeper for coordinator election. This 
> is done through network abstraction _ZookeeperNetwork_. We need to modularize 
> this part in order to enable replicated_log when using Master 
> contender/detector modules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9726) No running tasks in marathon after restart non-leader mesos-master node

2019-04-22 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823423#comment-16823423
 ] 

Joseph Wu commented on MESOS-9726:
--

Logs of all three mesos masters would be helpful.

Also some quick questions to get us on the same page:
* What method did you use to determine the active Mesos leader?  (Marathon 
leader is not the Mesos leader.)
* When you "restart" anything, is this referring to stopping/starting a 
service?  Or rebooting an entire node?

> No running tasks in marathon after restart non-leader mesos-master node
> ---
>
> Key: MESOS-9726
> URL: https://issues.apache.org/jira/browse/MESOS-9726
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Alexandr
>Priority: Minor
>  Labels: beginner
>
> Good day!
>  I have problem with my mesos-cluster:
>  I have 3 mesos-master (Mesos version 1.4.1) nodes with Marathon cluster v 
> 1.5.11 (lets call them - Mas1,Mas2,Mas3) and 3 mesos-slave nodes (lets call 
> them - Slv1,Slv2,Slv3) with running apps on it. And I also have a Zookeeper 
> cluster on nodes Mas1,Mas2,Mas3.
>  So, Mesos leader - master-node "Mas1".
>  After I restarted master-node "Mas3" - he got back to the Mesos cluster, 
> everything is fine but after a moment I opened Marathon and all running tasks 
> from my mesos-slave nodes became "unknown" and have no instances on it. 
>  So I checked:
>  1. My mesos agents - everything was ok, 3 agents running. 
>  2. That all services are running and all clusters (Mesos\Marathon\Zookeeper) 
> are fine
>  3. Decided to restart all mesos-slave services on slave nodes - on 
> slave-node Slv3 1 of 3 instances launched for all applications, then 
> restarted all marathon-services. After it all tasks switched to status 
> "Waiting"\"Delayed".
>  4. Checked mesos-master and slave logs, no errors or information about any 
> problems on cluster - only information about killing and launching new tasks 
> on slave-node.
>  5. Decided to stop-start mesos-master service for re-election of a Mesos 
> leader. 
>  After it leader became master-node "Mas2" and all tasks in marathon started 
> to run instances like normal. 
> Logs will be uploaded later. Wonder how it could happen
> {code:java}
> Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.34240612 http.cpp:**] 
> HTTP GET for /master/state from :*** with User-Agent='Go-http-client/1.1' 
> Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.391480 9 
> http.cpp:1185] HTTP GET for /master/state from *:40686 with 
> User-Agent='Go-http-client/1.1' Apr 10 18:32:18 Mas1 docker[27956]: E0410 
> 15:32:18.56788014 process.cpp:2577] Failed to shutdown socket with fd 44, 
> address :5050: Transport endpoint is not connected Apr 10 18:32:18 Mas1 
> docker[27956]: E0410 15:32:18.59690114 process.cpp:2577] Failed to 
> shutdown socket with fd 46, address :5050: Transport endpoint is not 
> connected
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9727) Heartbeat calls from executor to agent are reported as errors

2019-04-10 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9727:


Assignee: Joseph Wu

> Heartbeat calls from executor to agent are reported as errors
> -
>
> Key: MESOS-9727
> URL: https://issues.apache.org/jira/browse/MESOS-9727
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: foundations
>
> These HEARTBEAT calls and events were added in MESOS-7564. 
> HEARTBEAT calls are generated by the executor library, which does not have 
> access to the executor's Framework/Executor IDs.  The library therefore uses 
> some dummy values instead, because HEARTBEAT calls do not really require 
> required fields.  When the agent receives these dummy values, it returns a 
> 400 Bad Request.  It should return 202 Accepted instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9727) Heartbeat calls from executor to agent are reported as errors

2019-04-10 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9727:


 Summary: Heartbeat calls from executor to agent are reported as 
errors
 Key: MESOS-9727
 URL: https://issues.apache.org/jira/browse/MESOS-9727
 Project: Mesos
  Issue Type: Bug
  Components: agent, executor
Affects Versions: 1.8.0
Reporter: Joseph Wu


These HEARTBEAT calls and events were added in MESOS-7564. 

HEARTBEAT calls are generated by the executor library, which does not have 
access to the executor's Framework/Executor IDs.  The library therefore uses 
some dummy values instead, because HEARTBEAT calls do not really require 
required fields.  When the agent receives these dummy values, it returns a 400 
Bad Request.  It should return 202 Accepted instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode

2019-04-10 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814600#comment-16814600
 ] 

Joseph Wu edited comment on MESOS-9718 at 4/10/19 4:10 PM:
---

Looks like this error occurs because in C+\+20, a string literal like 
{{u8"..."}} translates into a {{const char8_t[N]}}, whereas in earlier C++ 
versions, the same expression gives a {{const char[N]}} type. We can implicitly 
convert to {{std::string}} from {{const char[N]}}, but not from {{const 
char8_t[N]}}, which should be held by a {{std::u8string}}.

I'm not sure if there is a quick fix for this, since we're still on C++14 or so.

Here's some further reading:
 [http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r3.html]


was (Author: kaysoky):
Looks like this error occurs because in C++20, a string literal like 
{{u8"..."}} translates into a {{const char8_t[N]}}, whereas in earlier C++ 
versions, the same expression gives a {{const char[N]}} type.  We can 
implicitly convert to {{std::string}} from {{const char[N]}}, but not from 
{{const char8_t[N]}}, which should be held by a {{std::u8string}}.

I'm not sure if there is a quick fix for this, since we're still on C++14 or so.

Here's some further reading:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r3.html

> Compile failures with char8_t by MSVC under /std:c++latest mode
> ---
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  

[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode

2019-04-10 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814600#comment-16814600
 ] 

Joseph Wu commented on MESOS-9718:
--

Looks like this error occurs because in C++20, a string literal like 
{{u8"..."}} translates into a {{const char8_t[N]}}, whereas in earlier C++ 
versions, the same expression gives a {{const char[N]}} type.  We can 
implicitly convert to {{std::string}} from {{const char[N]}}, but not from 
{{const char8_t[N]}}, which should be held by a {{std::u8string}}.

I'm not sure if there is a quick fix for this, since we're still on C++14 or so.

Here's some further reading:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r3.html

> Compile failures with char8_t by MSVC under /std:c++latest mode
> ---
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const char8_t [16]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: 
> 'AssertSomeEq': no matching overloaded 

[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2019-04-08 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812549#comment-16812549
 ] 

Joseph Wu commented on MESOS-6285:
--

MESOS-7947 is only a partial solution.  That ticket added completed task 
metadata directories to the agent's existing GC mechanism.  This means it is 
still possible to hit an OOM during recovery if:
1) We launch lots of tasks very quickly.  The GC settings won't clean up quick 
bursts of tasks until days or weeks later.
2) Or, we launch many tasks, with low disk utilization.  Since disk is usually 
much larger than memory, it is possible to have too much metadata to fit into 
memory, while not consuming that much space on disk.  Again, GC won't kick in 
for days/weeks.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9352) Data in persistent volume deleted accidentally when using Docker container and Persistent volume

2019-04-01 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9352:


Assignee: Joseph Wu

> Data in persistent volume deleted accidentally when using Docker container 
> and Persistent volume
> 
>
> Key: MESOS-9352
> URL: https://issues.apache.org/jira/browse/MESOS-9352
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker
>Affects Versions: 1.5.1, 1.5.2
> Environment: DCOS 1.11.6
> Mesos 1.5.2
>Reporter: David Ko
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: dcos, dcos-1.11.6, mesosphere, persistent-volumes
> Attachments: image-2018-10-24-22-20-51-059.png, 
> image-2018-10-24-22-21-13-399.png
>
>
> Using docker image w/ persistent volume to start a service, it will cause 
> data in persistent volume deleted accidentally when task killed and 
> restarted, also old mount points not unmounted, even the service already 
> deleted. 
> *The expected result should be data in persistent volume kept until task 
> deleted completely, also dangling mount points should be unmounted correctly.*
>  
> *Step 1:* Use below JSON config to create a Mysql server using Docker image 
> and Persistent Volume
> {code:javascript}
> {
>   "env": {
> "MYSQL_USER": "wordpress",
> "MYSQL_PASSWORD": "secret",
> "MYSQL_ROOT_PASSWORD": "supersecret",
> "MYSQL_DATABASE": "wordpress"
>   },
>   "id": "/mysqlgc",
>   "backoffFactor": 1.15,
>   "backoffSeconds": 1,
>   "constraints": [
> [
>   "hostname",
>   "IS",
>   "172.27.12.216"
> ]
>   ],
>   "container": {
> "portMappings": [
>   {
> "containerPort": 3306,
> "hostPort": 0,
> "protocol": "tcp",
> "servicePort": 1
>   }
> ],
> "type": "DOCKER",
> "volumes": [
>   {
> "persistent": {
>   "type": "root",
>   "size": 1000,
>   "constraints": []
> },
> "mode": "RW",
> "containerPath": "mysqldata"
>   },
>   {
> "containerPath": "/var/lib/mysql",
> "hostPath": "mysqldata",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "mysql",
>   "forcePullImage": false,
>   "privileged": false,
>   "parameters": []
> }
>   },
>   "cpus": 1,
>   "disk": 0,
>   "instances": 1,
>   "maxLaunchDelaySeconds": 3600,
>   "mem": 512,
>   "gpus": 0,
>   "networks": [
> {
>   "mode": "container/bridge"
> }
>   ],
>   "residency": {
> "relaunchEscalationTimeoutSeconds": 3600,
> "taskLostBehavior": "WAIT_FOREVER"
>   },
>   "requirePorts": false,
>   "upgradeStrategy": {
> "maximumOverCapacity": 0,
> "minimumHealthCapacity": 0
>   },
>   "killSelection": "YOUNGEST_FIRST",
>   "unreachableStrategy": "disabled",
>   "healthChecks": [],
>   "fetch": []
> }
> {code}
> *Step 2:* Kill mysqld process to force rescheduling new Mysql task, but found 
> 2 mount points to the same persistent volume, it means old mount point did 
> not be unmounted immediately.
> !image-2018-10-24-22-20-51-059.png!
> *Step 3:* After GC, data in persistent volume was deleted accidentally, but 
> mysqld (Mesos task) still running
> !image-2018-10-24-22-21-13-399.png!
> *Step 4:* Delete Mysql service from Marathon, all mount points unable to 
> unmount, even the service already deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9646) Look into enabling the libarchive extraction flag ARCHIVE_EXTRACT_SECURE_NOABSOLUTEPATHS by defaul

2019-03-11 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9646:


 Summary: Look into enabling the libarchive extraction flag 
ARCHIVE_EXTRACT_SECURE_NOABSOLUTEPATHS by defaul
 Key: MESOS-9646
 URL: https://issues.apache.org/jira/browse/MESOS-9646
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.7.0, 1.8.0
Reporter: Joseph Wu


The libarchive source provides the following flag: 
{code}
/* Default: Do not try to guard against extracts redirected by symlinks. */
/* Note: With ARCHIVE_EXTRACT_UNLINK, will remove any intermediate symlink. */
#define ARCHIVE_EXTRACT_SECURE_SYMLINKS (0x0100)
{code}
https://github.com/libarchive/libarchive/blob/master/libarchive/archive.h#L672-L674

We should check if the default behavior is unsecure (i.e. allowing a fetched 
artifact to affect files outside the sandbox).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9635) OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky again (3x) due to orphan operations

2019-03-06 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9635:


Assignee: Joseph Wu

> OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky 
> again (3x) due to orphan operations
> -
>
> Key: MESOS-9635
> URL: https://issues.apache.org/jira/browse/MESOS-9635
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Joseph Wu
>Priority: Major
>
> This test fails consistently when run while the system is stressed:
> {code}
> [ RUN  ] 
> ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0
> F0305 08:10:07.670622  3982 hierarchical.cpp:1259] Check failed: 
> slave.getAllocated().contains(resources) {} does not contain disk(allocated: 
> default-role)[RAW(,,profile)]:200
> *** Check failure stack trace: ***
> @ 0x7f1120b0ce5e  google::LogMessage::Fail()
> @ 0x7f1120b0cdbb  google::LogMessage::SendToLog()
> @ 0x7f1120b0c7b5  google::LogMessage::Flush()
> @ 0x7f1120b0f578  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f111e536f2a  
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
> @ 0x5580c2651c26  
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDERKNS1_7SlaveIDERKNS1_9ResourcesERK6OptionINS1_7FiltersEES8_SB_SE_SJ_EEvRKNS_3PIDIT_EEMSL_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_ENKUlOS6_OS9_OSC_OSH_PNS_11ProcessBaseEE_clES13_S14_S15_S16_S18_
> @ 0x5580c26c7e02  
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDERKNS3_7SlaveIDERKNS3_9ResourcesERK6OptionINS3_7FiltersEESA_SD_SG_SL_EEvRKNS1_3PIDIT_EEMSN_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOS8_OSB_OSE_OSJ_PNS1_11ProcessBaseEE_JS8_SB_SE_SJ_S1A_EEEDTclcl7forwardISN_Efp_Espcl7forwardIT0_Efp0_EEEOSN_DpOS1C_
> @ 0x5580c26c5b1e  
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDERKNS4_7SlaveIDERKNS4_9ResourcesERK6OptionINS4_7FiltersEESB_SE_SH_SM_EEvRKNS2_3PIDIT_EEMSO_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOS9_OSC_OSF_OSK_PNS2_11ProcessBaseEE_JS9_SC_SF_SK_St12_PlaceholderILi113invoke_expandIS1C_St5tupleIJS9_SC_SF_SK_S1E_EES1H_IJOS1B_EEJLm0ELm1ELm2ELm3ELm4DTcl6invokecl7forwardISO_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISS_Efp0_EEcl7forwardIST_Efp2_OSO_OSS_N5cpp1416integer_sequenceImJXspT2_OST_
> @ 0x5580c26c47ac  
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDERKNS4_7SlaveIDERKNS4_9ResourcesERK6OptionINS4_7FiltersEESB_SE_SH_SM_EEvRKNS2_3PIDIT_EEMSO_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOS9_OSC_OSF_OSK_PNS2_11ProcessBaseEE_JS9_SC_SF_SK_St12_PlaceholderILi1clIJS1B_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS1K_
> @ 0x5580c26c3ad7  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDERKNS6_7SlaveIDERKNS6_9ResourcesERK6OptionINS6_7FiltersEESD_SG_SJ_SO_EEvRKNS4_3PIDIT_EEMSQ_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOSB_OSE_OSH_OSM_PNS4_11ProcessBaseEE_JSB_SE_SH_SM_St12_PlaceholderILi1EJS1D_EEEDTclcl7forwardISQ_Efp_Espcl7forwardIT0_Efp0_EEEOSQ_DpOS1I_
> @ 0x5580c26c32ad  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDERKNS7_7SlaveIDERKNS7_9ResourcesERK6OptionINS7_7FiltersEESE_SH_SK_SP_EEvRKNS5_3PIDIT_EEMSR_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOSC_OSF_OSI_OSN_PNS5_11ProcessBaseEE_JSC_SF_SI_SN_St12_PlaceholderILi1EJS1E_EEEvOSR_DpOT0_
> @ 0x5580c26c0a5e  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDERKNSA_7SlaveIDERKNSA_9ResourcesERK6OptionINSA_7FiltersEESH_SK_SN_SS_EEvRKNS1_3PIDIT_EEMSU_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOSF_OSI_OSL_OSQ_S3_E_JSF_SI_SL_SQ_St12_PlaceholderILi1EEclEOS3_
> @ 0x7f1120a51c60  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7f1120a16a4e  process::ProcessBase::consume()
> @ 0x7f1120a3d9d8  
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @ 0x5580c2284afa  process::ProcessBase::serve()
> @ 0x7f1120a138db  process::ProcessManager::resume()
> @ 0x7f1120a0fc28  
> _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
> 

[jira] [Created] (MESOS-9635) OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky again (3x) due to orphan operations

2019-03-06 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9635:


 Summary: 
OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky 
again (3x) due to orphan operations
 Key: MESOS-9635
 URL: https://issues.apache.org/jira/browse/MESOS-9635
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


This test can be seen failing quite frequently with the following error:
{code}
Error Message

../../src/tests/operation_reconciliation_tests.cpp:864
  Expected: OPERATION_PENDING
To be equal to: operationStatus.state()
  Which is: OPERATION_UNKNOWN
{code}

which seems to be a different issue from the one described in MESOS-8872.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9610) Fetcher vulnerability - escaping from sandbox

2019-02-26 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778613#comment-16778613
 ] 

Joseph Wu commented on MESOS-9610:
--

This is related to the introduction of libarchive in 1.7.0.

The code which creates files/directories does not sanitize paths for extraneous 
".."s:
https://github.com/apache/mesos/blob/4a2dbe25c7377636fe3a9d9c8576297a6db561cd/3rdparty/stout/include/stout/archiver.hpp#L128-L130

> Fetcher vulnerability - escaping from sandbox
> -
>
> Key: MESOS-9610
> URL: https://issues.apache.org/jira/browse/MESOS-9610
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7.2
>Reporter: Mariusz Derela
>Priority: Blocker
>  Labels: bug, security-issue, vulnerabilities
>
> I have noticed that there is a possibility to exploit fetcher and  overwrite 
> any file on the agent host.
> scenario to reproduce:
> 1) prepare a file with any content and name a file like "../../../etc/test" 
> and archive it. We can use python and zipfile module to achieve that:
> {code:java}
> >>> import zipfile
> >>> zip = zipfile.ZipFile("exploit.zip", "w")
> >>> zip.writestr("../../../../../../../../../../../../etc/mariusz_was_here.txt",
> >>>  "some content")
> >>> zip.close()
> {code}
> 2) prepare a service that will use our artifact (exploit.zip)
> 3) run service
> at the end in /etc we will get our file. As you can imagine there is a lot 
> possibility how we can use it.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9564) Logrotate container logger lets tasks execute arbitrary commands in the Mesos agent's namespace

2019-02-26 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778488#comment-16778488
 ] 

Joseph Wu commented on MESOS-9564:
--

I'll be backporting to 1.5.x and beyond, but the backports will not block any 
of the ongoing releases since the module is optional.

> Logrotate container logger lets tasks execute arbitrary commands in the Mesos 
> agent's namespace
> ---
>
> Key: MESOS-9564
> URL: https://issues.apache.org/jira/browse/MESOS-9564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, modules
>Reporter: Joseph Wu
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: foundations, mesosphere
> Fix For: 1.8.0
>
>
> The non-default {{LogrotateContainerLogger}} module allows tasks to configure 
> sandbox log rotation (See 
> http://mesos.apache.org/documentation/latest/logging/#Containers ).  The 
> {{logrotate_stdout_options}} and {{logrotate_stderr_options}} in particular 
> let the task specify free-form text, which is written to a configuration file 
> located in the task's sandbox.  The module does not sanitize or check this 
> configuration at all.
> The logger itself will eventually run {{logrotate}} against the written 
> configuration file, but the logger is not isolated in the same way as the 
> task.  For both the Mesos and Docker containerizers, the logger binary will 
> run in the same namespace as the Mesos agent.  This makes it possible to 
> affect files outside of the task's mount namespace.
> Two modes of attack are known to be problematic:
> * Changing or adding entries to the configuration file.  Normally, the 
> configuration file contains a single file to rotate:
> {code}
> /path/to/sandbox/stdout {
>   
> }
> {code}
> It is trivial to add text to the {{logrotate_stdout_options}} to add a new 
> entry:
> {code}
> /path/to/sandbox/stdout {
>   
> }
> /path/to/other/file/on/disk {
>   
> }
> {code}
> * Logrotate's {{postrotate}} option allows for execution of arbitrary 
> commands.  This can again be supplied with the {{logrotate_stdout_options}} 
> variable.
> {code}
> /path/to/sandbox/stdout {
>   postrotate
> rm -rf /
>   endscript
> }
> {code}
> Some potential fixes to consider:
> * Overwrite the .logrotate.conf files each time. This would give only 
> milliseconds between writing and calling logrotate for a thirdparty to modify 
> the config files maliciously. This would not help if the task itself had 
> postrotate options in its environment variables.
> * Sanitize the free-form options field in the environment variables to remove 
> postrotate or injection attempts like }\n/path/to/some/file\noptions{.
> * Refactor parts of the Mesos isolation code path so that the logger and IO 
> switchboard binary live in the same namespaces as the container (instead of 
> the agent). This would also be nice in that the logger's CPU usage would then 
> be accounted for within the container's resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9542) Hierarchical allocator check failure when an operation on a shutdown framework finishes

2019-02-12 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1676#comment-1676
 ] 

Joseph Wu commented on MESOS-9542:
--

Still in progress, but some reviews are up starting at: 
https://reviews.apache.org/r/69960/

> Hierarchical allocator check failure when an operation on a shutdown 
> framework finishes
> ---
>
> Key: MESOS-9542
> URL: https://issues.apache.org/jira/browse/MESOS-9542
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: foundations, mesosphere, mesosphere-dss-ga, 
> operation-feedback
>
> When a non-speculated operation like e.g., {{CREATE_DISK}} becomes terminal 
> after the originating framework was torn down, we run into an assertion 
> failure in the allocator.
> {noformat}
> I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation 
> 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework 
> a4d0499b-c0d3-4abf-8458-73e595d061ce- (latest state: OPERATION_PENDING, 
> status update state: OPERATION_FINISHED)
> F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: 
> frameworks.contains(frameworkId){noformat}
> With non-speculated operations like e.g., {{CREATE_DISK}} it became possible 
> that operations outlive their originating framework. This was not possible 
> with speculated operations like {{RESERVE}} which were always applied 
> immediately by the master.
> The master does not take this into account, but instead unconditionally calls 
> {{Allocator::updateAllocation}} which asserts that the framework is still 
> known to the allocator.
> Reproducer:
>  * register a framework with the master.
>  * add a master with a resource provider.
>  * let the framework trigger a non-speculated operation like {{CREATE_DISK.}}
>  * tear down the framework before a terminal operation status update reaches 
> the master; this causes the master to e.g., remove the framework from the 
> allocator.
>  * let a terminal, successful operation status update reach the master
>  *  
> To solve this we should cleanup the lifetimes of operations. Since operations 
> can outlive their framework (unlike e.g., tasks), we probably need a 
> different approach here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9564) Logrotate container logger lets tasks execute arbitrary commands in the Mesos agent's namespace

2019-02-11 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9564:


 Summary: Logrotate container logger lets tasks execute arbitrary 
commands in the Mesos agent's namespace
 Key: MESOS-9564
 URL: https://issues.apache.org/jira/browse/MESOS-9564
 Project: Mesos
  Issue Type: Bug
  Components: agent, modules
Reporter: Joseph Wu


The non-default {{LogrotateContainerLogger}} module allows tasks to configure 
sandbox log rotation (See 
http://mesos.apache.org/documentation/latest/logging/#Containers ).  The 
{{logrotate_stdout_options}} and {{logrotate_stderr_options}} in particular let 
the task specify free-form text, which is written to a configuration file 
located in the task's sandbox.  The module does not sanitize or check this 
configuration at all.

The logger itself will eventually run {{logrotate}} against the written 
configuration file, but the logger is not isolated in the same way as the task. 
 For both the Mesos and Docker containerizers, the logger binary will run in 
the same namespace as the Mesos agent.  This makes it possible to affect files 
outside of the task's mount namespace.

Two modes of attack are known to be problematic:
* Changing or adding entries to the configuration file.  Normally, the 
configuration file contains a single file to rotate:
{code}
/path/to/sandbox/stdout {
  
}
{code}
It is trivial to add text to the {{logrotate_stdout_options}} to add a new 
entry:
{code}
/path/to/sandbox/stdout {
  
}
/path/to/other/file/on/disk {
  
}
{code}
* Logrotate's {{postrotate}} option allows for execution of arbitrary commands. 
 This can again be supplied with the {{logrotate_stdout_options}} variable.
{code}
/path/to/sandbox/stdout {
  postrotate
rm -rf /
  endscript
}
{code}

Some potential fixes to consider:
* Overwrite the .logrotate.conf files each time. This would give only 
milliseconds between writing and calling logrotate for a thirdparty to modify 
the config files maliciously. This would not help if the task itself had 
postrotate options in its environment variables.
* Sanitize the free-form options field in the environment variables to remove 
postrotate or injection attempts like }\n/path/to/some/file\noptions{.
* Refactor parts of the Mesos isolation code path so that the logger and IO 
switchboard binary live in the same namespaces as the container (instead of the 
agent). This would also be nice in that the logger's CPU usage would then be 
accounted for within the container's resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9557) Operations are leaked in Framework struct when agents are removed

2019-02-11 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9557:


Assignee: Joseph Wu

> Operations are leaked in Framework struct when agents are removed
> -
>
> Key: MESOS-9557
> URL: https://issues.apache.org/jira/browse/MESOS-9557
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> Currently, when agents are removed from the master, their operations are not 
> removed from the {{Framework}} structs. We should ensure that this occurs in 
> all cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

2019-01-23 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750531#comment-16750531
 ] 

Joseph Wu commented on MESOS-9507:
--

One possible fix is to add a conditional between these two blocks:
https://github.com/apache/mesos/blob/0f8ee9555f89f0a5f139bc12c666a60164c7b09b/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L277-L287

{code}
  if (read.isNone()) {
// This could happen if the agent died after opening the file for writing
// but before it checkpointed anything.
LOG(WARNING) << "Some descriptive warning";

// 
  }
{code}

> Agent could not recover due to empty docker volume checkpointed files.
> --
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Critical
>  Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but 
> before checkpointing finishes. Basically the docker volume is not mounted 
> yet, so the docker volume isolator should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9527) Agent does not check if master pings come from expected leader

2019-01-16 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9527:


 Summary: Agent does not check if master pings come from expected 
leader
 Key: MESOS-9527
 URL: https://issues.apache.org/jira/browse/MESOS-9527
 Project: Mesos
  Issue Type: Bug
  Components: agent, master
Affects Versions: 1.2.0
Reporter: Joseph Wu


The agent code that receives pings from the master does not check if the ping 
comes from an expected source:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5944-L5946

This can be problematic if, for some reason, the agent is moved from one 
cluster to another.
For example:
# First, I started two masters, on localhost:
{code}
src/mesos-master --work_dir=/tmp/master1
src/mesos-master --work_dir=/tmp/master2 --port=5052
{code}
# Next, I started an agent and pointed it at the first master
{code}
src/mesos-agent --work_dir=/tmp/agent --master=127.0.0.1:5050
{code}
# I promptly killed the agent after it had registered and pointed it at the 
second master
{code}
src/mesos-agent --work_dir=/tmp/agent --master=127.0.0.1:5052
{code}

The agent was now disconnected from Master1, and connected to Master2.  
However, Master1 continues to ping the agent (saying the agent is 
disconnected).  This causes the agent to re-register with Master2 every 15 
seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9526) Remove "local" cluster functionality from the native scheduler libraries.

2019-01-16 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9526:


 Summary: Remove "local" cluster functionality from the native 
scheduler libraries.
 Key: MESOS-9526
 URL: https://issues.apache.org/jira/browse/MESOS-9526
 Project: Mesos
  Issue Type: Task
  Components: cmake
Reporter: Joseph Wu


Schedulers that link to libmesos currently have the option of specifying a 
master like {{--master=local}}, which causes the scheduler library to spin up a 
local Mesos cluster (masters & agents).  This function is used by the various 
example frameworks, and could potentially be used for other tests.

The downside of this feature is that the scheduler library is required to pull 
in the entire source of the master and agent.

The example framework tests could be changed to launch masters/agents in the 
test body, instead of local clusters.  This would have the added benefit of 
improving the reliability of those tests (because they would be more easily 
synchronized and have more conditions we can verify).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-3542) Separate libmesos into compiling from many binaries.

2019-01-16 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-3542:


Shepherd: Benjamin Bannier  (was: Joris Van Remoortere)
Assignee: Joseph Wu
  Labels: cmake foundations mesosphere  (was: cmake mesosphere)

> Separate libmesos into compiling from many binaries.
> 
>
> Key: MESOS-3542
> URL: https://issues.apache.org/jira/browse/MESOS-3542
> Project: Mesos
>  Issue Type: Epic
>  Components: cmake
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: cmake, foundations, mesosphere
>
> Historically libmesos is built as a huge monolithic binary. Another idea 
> would be to build it from a bunch of smaller libraries (_e.g._, libagent, 
> _etc_.).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8262) CMake build with java enabled fails during linking step.

2019-01-16 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-8262:


Assignee: Joseph Wu

> CMake build with java enabled fails during linking step.
> 
>
> Key: MESOS-8262
> URL: https://issues.apache.org/jira/browse/MESOS-8262
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.0
> Environment: Mac OS 10.11.6
>Reporter: Alexander Rukletsov
>Assignee: Joseph Wu
>Priority: Major
>  Labels: build, cmake
>
> I've enabled JAVA in cmake build and have run the complete build via {{ninja 
> check}}. Build failed with the following output:
> {noformat}
> [312/689] Linking CXX shared library src/libmesos-java.dylib
> FAILED: src/libmesos-java.dylib 
> : && /Library/Developer/CommandLineTools/usr/bin/c++ -std=c++11 
> -Wformat-security -fstack-protector-strong  -dynamiclib 
> -Wl,-headerpad_max_install_names  -o src/libmesos-java.dylib -install_name 
> @rpath/libmesos-java.dylib 
> src/CMakeFiles/mesos-java.dir/java/jni/convert.cpp.o 
> src/CMakeFiles/mesos-java.dir/java/jni/construct.cpp.o 
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_Log.cpp.o 
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_MesosExecutorDriver.cpp.o
>  
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_MesosNativeLibrary.cpp.o
>  
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp.o
>  
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_state_AbstractState.cpp.o
>  
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_state_LevelDBState.cpp.o
>  src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_state_LogState.cpp.o 
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_state_Variable.cpp.o 
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_state_ZooKeeperState.cpp.o
>  
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp.o
>  
> src/CMakeFiles/mesos-java.dir/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp.o
>  src/CMakeFiles/mesos-java.dir/jvm/jvm.cpp.o 
> src/CMakeFiles/mesos-java.dir/jvm/org/apache/log4j.cpp.o 
> src/CMakeFiles/mesos-java.dir/jvm/org/apache/zookeeper.cpp.o  
> -Wl,-rpath,/Users/alex/Projects/mesos.build/src 
> -Wl,-rpath,/Users/alex/Projects/mesos.build/3rdparty/libprocess/src 
> src/libmesos-protobufs.dylib 3rdparty/libprocess/src/libprocess.dylib 
> 3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8-build/libzookeeper.a -framework 
> JavaVM -framework JavaVM 
> 3rdparty/protobuf-3.5.0/src/protobuf-3.5.0-build/libprotobuf.dylib 
> /usr/local/opt/apr/libexec/lib/libapr-1.dylib /usr/lib/libcurl.dylib 
> 3rdparty/glog-0.3.3/src/glog-0.3.3-build/lib/libglog.dylib 
> /usr/lib/libz.dylib /usr/local/opt/subversion/lib/libsvn_delta-1.dylib 
> /usr/local/opt/subversion/lib/libsvn_diff-1.dylib 
> /usr/local/opt/subversion/lib/libsvn_subr-1.dylib 
> 3rdparty/http_parser-2.6.2/src/http_parser-2.6.2-build/libhttp_parser.a 
> 3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8-build/libhashtable.a && :
> Undefined symbols for architecture x86_64:
>   "mesos::MesosExecutorDriver::MesosExecutorDriver(mesos::Executor*)", 
> referenced from:
>   _Java_org_apache_mesos_MesosExecutorDriver_initialize in 
> org_apache_mesos_MesosExecutorDriver.cpp.o
> <...>
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2019-01-07 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736216#comment-16736216
 ] 

Joseph Wu commented on MESOS-7622:
--

No, the executor changes I was making (MESOS-7564) touch the executor 
subscription code, but shouldn't affect how/when the executor decides to 
register.

Without diving too deeply, these two logs stand out:
{code}
I0605 14:58:25.247808 10718 slave.cpp:3825] Got registration for executor 
'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
{code}
{code}
I0605 14:58:25.352342 10712 slave.cpp:3609] Received Subscribe request for HTTP 
executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
{code}
The same executor registers twice, once as a PID executor, and once as an HTTP 
executor.  The timestamps are close enough to suggest both registrations are 
happening at the same time.

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.2.2
>Reporter: Aaron Wood
>Priority: Critical
>  Labels: foundations
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" 

[jira] [Commented] (MESOS-9420) Limit the number of tasks can be run on a host

2018-11-27 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701351#comment-16701351
 ] 

Joseph Wu commented on MESOS-9420:
--

Perhaps you want to limit the number of +containers+ or executors, rather than 
the number of tasks.  Tasks in Mesos can appear in a variety of forms, like 
processes, nested containers, or even threads.  An executor is the closest 
equivalent of a "pod" in Kubernetes.

A few possible questions:
1) Would you want the limit to apply equally to containers launched under the 
Mesos vs Docker containerizers?  i.e. a shared total, or an individual total?
2) What would you want to do if the agent ends up under-utilized after reaching 
the max number of tasks/executors/pods?  It is possible to launch a large 
number of tiny executors.
3) How would you want to handle reserved resources (such as persistent volumes) 
on nodes that have reached the maximum?  

> Limit the number of tasks can be run on a host
> --
>
> Key: MESOS-9420
> URL: https://issues.apache.org/jira/browse/MESOS-9420
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: haoyuan ge
>Priority: Minor
>
> Operators may want to limit the number of tasks running on a single host. 
> Like _kubelet --max-pods_ which can limit the total pods running on a host. 
> Can we make mesos-agent limit the task number also?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2018-11-27 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701112#comment-16701112
 ] 

Joseph Wu commented on MESOS-7564:
--

Here are the proposed protobuf changes: https://reviews.apache.org/r/69463/

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Anand Mazumdar
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2018-11-27 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701042#comment-16701042
 ] 

Joseph Wu commented on MESOS-7564:
--

I guess I'll summarize a bit of the discussion that happened in the API WG.

The current plan is to add some regular traffic to any persistent connections 
between agent and executor, so that the connection does not get marked "stale". 
 We want to make a minimal change first, to maintain backwards compatibility 
between new/old agents and new/old executors.  Since there are two persistent 
connections, we want to add Heartbeat Events from Agent to Executor, and 
Heartbeat Calls from Executor to Agent.  Neither agent nor executor will expect 
heartbeats (i.e. they won't disconnect if heartbeats don't appear).  
Unfortunately, in the case of old agents/executors, when they receive an 
unknown Call/Event, they will log a warning.

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Anand Mazumdar
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2018-11-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687389#comment-16687389
 ] 

Joseph Wu commented on MESOS-7564:
--

Historically, we've considered the agent<->executor connection to be reliable.  
This is evident when you look at the agent's lack of handling for executor 
disconnections.  Currently, if an HTTP executor successfully registers, and 
then closes its connection, the agent will consider the executor "RUNNING".  
The agent will then merrily send all sorts of messages over the broken 
connection (and onto the floor), including LaunchTask messages.  The agent 
might log warnings, but it does not attempt to reconnect (it can't).  (The PID 
executor does not have this problem, because libprocess will make transient 
connections to send messages if the persistent connection breaks.)

If we are considering the agent<->executor connection to be unreliable, we 
first need to add/test logic to handle executor disconnections.  I believe it 
may be sufficient to detect (even belatedly) disconnections on the agent, and 
transition the agent's view of the executor from RUNNING to REGISTERING and 
start the registration timeout.  This would only be necessary for HTTP 
executors.

-

Next to handle cases where the connection is "connected" but dropping 
packets...   We will probably want to add heartbeats in both directions.

Just on the HTTP executor library, we have two connections to consider:
1) The SUBSCRIBE Call is one persistent connection where the executor sends one 
Call, and receives a stream of Events.  There is currently no Executor->Agent 
traffic except the first request.  This connection could probably use 
heartbeating in both directions.  Agent->Executor heartbeats may come in the 
form of Events.  Executor->Agent heartbeats will need to be something else 
(like the heartbeating suggested here: https://reviews.apache.org/r/69183/ ).

2) Other calls go through a secondary connection.  This persistent connection 
is used to send any number of Calls and their subsequent responses (202 
Accepted) back.  When the executor discovers a disconnection here, it remakes 
both connections.  This connection does not need heartbeating or monitoring.


> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Anand Mazumdar
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-11-09 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682078#comment-16682078
 ] 

Joseph Wu commented on MESOS-9258:
--

Alternative proposal for bounding the max number of subscribers:
https://reviews.apache.org/r/69307/

This one requires almost no client-side changes (as long as clients already 
retry when disconnected) and the code changes are also somewhat minimal from a 
backporting perspective.

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-11-08 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680729#comment-16680729
 ] 

Joseph Wu commented on MESOS-9258:
--

Prototype for the max lifetime proposal:
https://reviews.apache.org/r/69302/

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2018-11-08 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-7564:


Assignee: Joseph Wu

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Anand Mazumdar
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6949) SchedulerTest.MasterFailover is flaky

2018-11-07 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-6949:


Assignee: Joseph Wu  (was: Alexander Rukletsov)

> SchedulerTest.MasterFailover is flaky
> -
>
> Key: MESOS-6949
> URL: https://issues.apache.org/jira/browse/MESOS-6949
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Observed on:
> CentOS 7 VM, libevent and SSL enabled;
> Ubuntu 14.04, cmake/clang, without libevent/SSL, on ASF CI
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: flaky-test, tests
> Fix For: 1.5.0
>
> Attachments: MasterFailover-badrun.txt, 
> SchedulerTest.MasterFailover-on-ASF-CI.txt, SchedulerTest.MasterFailover.txt, 
> SchedulerTest_MasterFailover_1_badrun.txt
>
>
> This was observed in a CentOS 7 VM, with libevent and SSL enabled:
> {code}
> W0118 22:38:33.789465  3407 scheduler.cpp:513] Dropping SUBSCRIBE: Scheduler 
> is in state DISCONNECTED
> I0118 22:38:33.811820  3408 scheduler.cpp:361] Connected with the master at 
> http://127.0.0.1:43211/master/api/v1/scheduler
> ../../src/tests/scheduler_tests.cpp:315: Failure
> Mock function called more times than expected - returning directly.
> Function call: connected(0x7fff97227550)
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {code}
> Find attached the entire log from a failed run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8723) ROOT_HealthCheckUsingPersistentVolume is flaky.

2018-11-06 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677515#comment-16677515
 ] 

Joseph Wu commented on MESOS-8723:
--

Another bad run, on the 1.6.x branch (Ubuntu 16)
{code}
[ RUN  ] 
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_HealthCheckUsingPersistentVolume/1
I1106 20:15:34.354775 32499 cluster.cpp:172] Creating default 'local' authorizer
I1106 20:15:34.355837 22262 master.cpp:463] Master 
ee3a72ac-f1ea-4572-ab7a-424ecc6e517c (ip-172-16-10-158.ec2.internal) started on 
172.16.10.158:46488
I1106 20:15:34.355865 22262 master.cpp:466] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/yQNdFw/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/yQNdFw/master" --zk_session_timeout="10secs"
I1106 20:15:34.356046 22262 master.cpp:515] Master only allowing authenticated 
frameworks to register
I1106 20:15:34.356058 22262 master.cpp:521] Master only allowing authenticated 
agents to register
I1106 20:15:34.356146 22262 master.cpp:527] Master only allowing authenticated 
HTTP frameworks to register
I1106 20:15:34.356154 22262 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/yQNdFw/credentials'
I1106 20:15:34.356290 22262 master.cpp:571] Using default 'crammd5' 
authenticator
I1106 20:15:34.356390 22262 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1106 20:15:34.356514 22262 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1106 20:15:34.356560 22262 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1106 20:15:34.356707 22262 master.cpp:652] Authorization enabled
I1106 20:15:34.356874 22263 hierarchical.cpp:177] Initialized hierarchical 
allocator process
I1106 20:15:34.356999 22263 whitelist_watcher.cpp:77] No whitelist given
I1106 20:15:34.357659 22263 master.cpp:2162] Elected as the leading master!
I1106 20:15:34.357681 22263 master.cpp:1717] Recovering from registrar
I1106 20:15:34.357723 22263 registrar.cpp:339] Recovering registrar
I1106 20:15:34.357962 22257 registrar.cpp:383] Successfully fetched the 
registry (0B) in 184832ns
I1106 20:15:34.358003 22257 registrar.cpp:487] Applied 1 operations in 7418ns; 
attempting to update the registry
I1106 20:15:34.358218 22262 registrar.cpp:544] Successfully updated the 
registry in 129792ns
I1106 20:15:34.358259 22262 registrar.cpp:416] Successfully recovered registrar
I1106 20:15:34.358475 22262 master.cpp:1831] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I1106 20:15:34.358522 22262 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
I1106 20:15:34.359434 32499 containerizer.cpp:296] Using isolation { 
environment_secret, network/cni, filesystem/posix, volume/sandbox_path }
I1106 20:15:34.361624 32499 linux_launcher.cpp:147] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I1106 20:15:34.362061 32499 provisioner.cpp:299] Using default backend 'overlay'
W1106 20:15:34.363476 32499 process.cpp:2829] Attempted to spawn already 
running process files@172.16.10.158:46488
I1106 20:15:34.363585 32499 cluster.cpp:460] Creating default 'local' authorizer
I1106 20:15:34.364131 22257 slave.cpp:265] Mesos agent started on 
(1090)@172.16.10.158:46488
I1106 20:15:34.364296 22257 slave.cpp:266] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/LauncherAndIsolationParam_PersistentVolumeDefaultExecutor_ROOT_HealthCheckUsingPersistentVolume_1_9fXuUo/store/appc"
 

[jira] [Commented] (MESOS-9285) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithAbsolutePathVolume is flaky

2018-11-06 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677505#comment-16677505
 ] 

Joseph Wu commented on MESOS-9285:
--

Observed on another internal CI run, albeit on CentOS7 and on the 1.5.x branch 
(https://github.com/apache/mesos/tree/6008868c715733b7d798279e9b39ae3483f7d955)
{code}
[ RUN  ] 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithAbsolutePathVolume
I1106 20:21:29.367887 31384 cluster.cpp:172] Creating default 'local' authorizer
I1106 20:21:29.368988 21440 master.cpp:457] Master 
0d219c14-565e-46dd-b5c2-56bd9e97e4d1 (ip-172-16-10-72.ec2.internal) started on 
172.16.10.72:46670
I1106 20:21:29.369009 21440 master.cpp:459] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/zLJ6wA/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/zLJ6wA/master" --zk_session_timeout="10secs"
I1106 20:21:29.369148 21440 master.cpp:508] Master only allowing authenticated 
frameworks to register
I1106 20:21:29.369156 21440 master.cpp:514] Master only allowing authenticated 
agents to register
I1106 20:21:29.369163 21440 master.cpp:520] Master only allowing authenticated 
HTTP frameworks to register
I1106 20:21:29.369168 21440 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/zLJ6wA/credentials'
I1106 20:21:29.369271 21440 master.cpp:564] Using default 'crammd5' 
authenticator
I1106 20:21:29.369320 21440 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1106 20:21:29.369357 21440 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1106 20:21:29.369383 21440 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1106 20:21:29.369403 21440 master.cpp:643] Authorization enabled
I1106 20:21:29.369576 21438 hierarchical.cpp:177] Initialized hierarchical 
allocator process
I1106 20:21:29.369608 21438 whitelist_watcher.cpp:77] No whitelist given
I1106 20:21:29.370110 21440 master.cpp:2247] Elected as the leading master!
I1106 20:21:29.370126 21440 master.cpp:1727] Recovering from registrar
I1106 20:21:29.370163 21440 registrar.cpp:347] Recovering registrar
I1106 20:21:29.370302 21437 registrar.cpp:391] Successfully fetched the 
registry (0B) in 114944ns
I1106 20:21:29.370345 21437 registrar.cpp:495] Applied 1 operations in 6941ns; 
attempting to update the registry
I1106 20:21:29.370510 21442 registrar.cpp:552] Successfully updated the 
registry in 143104ns
I1106 20:21:29.370553 21442 registrar.cpp:424] Successfully recovered registrar
I1106 20:21:29.370631 21442 master.cpp:1840] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to re-register
I1106 20:21:29.370667 21438 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
I1106 20:21:29.372022 31384 isolator.cpp:136] Initialized the docker volume 
information root directory at '/run/mesos/isolators/docker/volume'
I1106 20:21:29.379148 31384 linux_launcher.cpp:145] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
sh: hadoop: command not found
I1106 20:21:29.462009 31384 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
client is not available, exit status: 32512
I1106 20:21:29.462098 31384 registry_puller.cpp:129] Creating registry puller 
with docker registry 'https://registry-1.docker.io'
I1106 20:21:29.462872 31384 provisioner.cpp:299] Using default backend 'copy'
W1106 20:21:29.464048 31384 process.cpp:2745] Attempted to spawn already 
running process files@172.16.10.72:46670
I1106 

[jira] [Commented] (MESOS-6949) SchedulerTest.MasterFailover is flaky

2018-11-06 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677498#comment-16677498
 ] 

Joseph Wu commented on MESOS-6949:
--

Another fix: https://reviews.apache.org/r/69267/

> SchedulerTest.MasterFailover is flaky
> -
>
> Key: MESOS-6949
> URL: https://issues.apache.org/jira/browse/MESOS-6949
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Observed on:
> CentOS 7 VM, libevent and SSL enabled;
> Ubuntu 14.04, cmake/clang, without libevent/SSL, on ASF CI
>Reporter: Greg Mann
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test, tests
> Fix For: 1.5.0
>
> Attachments: MasterFailover-badrun.txt, 
> SchedulerTest.MasterFailover-on-ASF-CI.txt, SchedulerTest.MasterFailover.txt, 
> SchedulerTest_MasterFailover_1_badrun.txt
>
>
> This was observed in a CentOS 7 VM, with libevent and SSL enabled:
> {code}
> W0118 22:38:33.789465  3407 scheduler.cpp:513] Dropping SUBSCRIBE: Scheduler 
> is in state DISCONNECTED
> I0118 22:38:33.811820  3408 scheduler.cpp:361] Connected with the master at 
> http://127.0.0.1:43211/master/api/v1/scheduler
> ../../src/tests/scheduler_tests.cpp:315: Failure
> Mock function called more times than expected - returning directly.
> Function call: connected(0x7fff97227550)
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {code}
> Find attached the entire log from a failed run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-11-06 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677306#comment-16677306
 ] 

Joseph Wu commented on MESOS-7971:
--

Slightly different logs observed on an internal CI run (Ubuntu 16, no SSL).  
One HTTP response in this run expects a 202, but gets a 409 instead.
{code}
[ RUN  ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
I1106 19:50:14.650254 19563 cluster.cpp:162] Creating default 'local' authorizer
I1106 19:50:14.651284 19588 master.cpp:442] Master 
d5905469-73fc-4219-b939-c6056f1f62a1 (ip-172-16-10-48.ec2.internal) started on 
172.16.10.48:39946
I1106 19:50:14.651309 19588 master.cpp:444] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="50ms" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/fZatVl/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/fZatVl/master" 
--zk_session_timeout="10secs"
I1106 19:50:14.651437 19588 master.cpp:494] Master only allowing authenticated 
frameworks to register
I1106 19:50:14.651448 19588 master.cpp:508] Master only allowing authenticated 
agents to register
I1106 19:50:14.651453 19588 master.cpp:521] Master only allowing authenticated 
HTTP frameworks to register
I1106 19:50:14.651459 19588 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/fZatVl/credentials'
I1106 19:50:14.651548 19588 master.cpp:566] Using default 'crammd5' 
authenticator
I1106 19:50:14.651593 19588 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1106 19:50:14.651643 19588 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1106 19:50:14.651672 19588 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1106 19:50:14.651700 19588 master.cpp:646] Authorization enabled
W1106 19:50:14.651710 19588 master.cpp:709] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I1106 19:50:14.651803 19590 hierarchical.cpp:173] Initialized hierarchical 
allocator process
I1106 19:50:14.651830 19590 whitelist_watcher.cpp:77] No whitelist given
I1106 19:50:14.652432 19590 master.cpp:2200] Elected as the leading master!
I1106 19:50:14.652454 19590 master.cpp:1739] Recovering from registrar
I1106 19:50:14.652506 19590 registrar.cpp:347] Recovering registrar
I1106 19:50:14.652595 19590 registrar.cpp:391] Successfully fetched the 
registry (0B) in 72960ns
I1106 19:50:14.652622 19590 registrar.cpp:495] Applied 1 operations in 5332ns; 
attempting to update the registry
I1106 19:50:14.656131 19586 registrar.cpp:552] Successfully updated the 
registry in 3.472128ms
I1106 19:50:14.656177 19586 registrar.cpp:424] Successfully recovered registrar
I1106 19:50:14.656266 19588 master.cpp:1838] Recovered 0 agents from the 
registry (168B); allowing 10mins for agents to re-register
I1106 19:50:14.656299 19588 hierarchical.cpp:211] Skipping recovery of 
hierarchical allocator: nothing to recover
W1106 19:50:14.657806 19563 process.cpp:3196] Attempted to spawn already 
running process files@172.16.10.48:39946
I1106 19:50:14.658203 19563 containerizer.cpp:246] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret
I1106 19:50:14.661717 19563 linux_launcher.cpp:149] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I1106 19:50:14.662039 19563 provisioner.cpp:255] Using default backend 'overlay'
I1106 19:50:14.662547 19563 cluster.cpp:448] Creating default 'local' authorizer
I1106 19:50:14.662969 19589 slave.cpp:249] Mesos agent started on 
(378)@172.16.10.48:39946
I1106 19:50:14.662987 19589 slave.cpp:250] Flags at startup: --acls="" 

[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-11-02 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673558#comment-16673558
 ] 

Joseph Wu commented on MESOS-9258:
--

After some more investigation, requiring two-way streaming will not work for 
browsers (i.e. the WebUI) because two-way streaming requires websockets.  And 
the load balancers that do not close connections (i.e. Elastic LB) do not 
support websockets.

Now, we are considering two other workarounds:
1) Creating a separate {{HEARTBEAT}} API call and having the {{SUBSCRIBE}} 
return a stream ID.  This has the downside of requiring (sometimes) significant 
client-side changes as they would need to parse an additional message type, 
maintain state, and keep a separate thread for heartbeating.  This might also 
be harder to justify in a backport (if necessary)
2) Adding an optional field to the {{SUBSCRIBE}} call which lets the client set 
the maximum lifetime of a connection.  The master would unilaterally close the 
connection after the specified duration.  This change would require the client 
to have retry/reconnect logic (which would be expected anyway).

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9357) FetcherTest.DuplicateFileURI fails on macos

2018-10-29 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9357:


Assignee: Joseph Wu

> FetcherTest.DuplicateFileURI fails on macos
> ---
>
> Key: MESOS-9357
> URL: https://issues.apache.org/jira/browse/MESOS-9357
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>Priority: Major
>  Labels: flaky, macOS
>
> I see {{FetcherTest.DuplicateFileURI}} fail pretty reliably on macos, e.g., 
> 10.14.
> {noformat}
> ../../src/tests/fetcher_tests.cpp:173
> Value of: os::exists("two")
>   Actual: false
> Expected: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9360) Cloud

2018-10-29 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667471#comment-16667471
 ] 

Joseph Wu commented on MESOS-9360:
--

[~shajanajumudeen87], could you expand on what this ticket is for?

> Cloud
> -
>
> Key: MESOS-9360
> URL: https://issues.apache.org/jira/browse/MESOS-9360
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 1.7.0
>Reporter: Haja Najumudeen
>Priority: Critical
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8343) SchedulerHttpApiTest.UpdatePidToHttpScheduler is flaky.

2018-10-26 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-8343:


   Resolution: Fixed
 Assignee: Meng Zhu
Fix Version/s: 1.8.0

{code}
commit 15b78fa5f4bd0968ecf168c32e8b6bbb0e822688
Author: Meng Zhu 
Date:   Fri Oct 26 15:19:27 2018 -0700

Fixed flaky test `SchedulerHttpApiTest.UpdatePidToHttpScheduler`.

The test was flaky due to a race between scheduler driver stopping
during test teardown and the scheduler `error()` invocation.
Adding the missing synchronization for the expectation should
eliminate the race.

Review: https://reviews.apache.org/r/69176/
{code}

> SchedulerHttpApiTest.UpdatePidToHttpScheduler is flaky.
> ---
>
> Key: MESOS-8343
> URL: https://issues.apache.org/jira/browse/MESOS-8343
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.4.1
> Environment: Ubuntu 17.04 CMake
>Reporter: Alexander Rukletsov
>Assignee: Meng Zhu
>Priority: Major
>  Labels: flaky-test
> Fix For: 1.8.0
>
> Attachments: UpdatePidToHttpScheduler-badrun.txt
>
>
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/scheduler_http_api_tests.cpp:504
> Actual function call count doesn't match EXPECT_CALL(sched, error(_, _))...
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2018-10-25 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664483#comment-16664483
 ] 

Joseph Wu edited comment on MESOS-7974 at 10/26/18 12:58 AM:
-

The relevant review in the chain of MESOS-9258 that will fix this:
https://reviews.apache.org/r/69185/

(Note: Without this change in the chain, the master actor (after applying other 
patches) will crash upon receiving the streaming headers :D )


was (Author: kaysoky):
The relevant review in the chain of MESOS-9258 that will fix this:
https://reviews.apache.org/r/69185/

(Note: Without this change, the master actor will crash upon receiving the 
streaming headers :D )

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>Assignee: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2018-10-25 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664483#comment-16664483
 ] 

Joseph Wu commented on MESOS-7974:
--

The relevant review in the chain of MESOS-9258 that will fix this:
https://reviews.apache.org/r/69185/

(Note: Without this change, the master actor will crash upon receiving the 
streaming headers :D )

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>Assignee: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-10-25 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664480#comment-16664480
 ] 

Joseph Wu commented on MESOS-9258:
--

Still in progress, but a prototype is up for preliminary review starting here: 
https://reviews.apache.org/r/69180/

The idea is to let the {{master /api/v1 SUBSCRIBE}} call take a streaming 
request (optional) as well as a streaming response.  When the call is made via 
a streaming request, the same stream will be used to send heartbeats from 
client to master.

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2018-10-24 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-7974:


Assignee: Joseph Wu

Assigning this to myself because a change (allowing streaming requests) I'm 
making in MESOS-9258 may resolve this issue as a side-effect.  We might change 
the implementation details during review though, so no guarantees.

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>Assignee: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9309) Master Healthcheck Only Returns True

2018-10-23 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661061#comment-16661061
 ] 

Joseph Wu commented on MESOS-9309:
--

You are seeing that correctly.  Both the master and agent have a similar 
endpoint whose only return value is {{true}}.  These are purely meant to 
indicate the actors are running, with no information about their state.  
Metrics (i.e. {{/metrics/snapshot}}) expose more info about health.

> Master Healthcheck Only Returns True
> 
>
> Key: MESOS-9309
> URL: https://issues.apache.org/jira/browse/MESOS-9309
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gabriel Hartmann
>Priority: Major
>
> Unless I'm reading it wrong the current [Master health 
> check|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L1651] 
> doesn't do anything apart from return true.
> A possible candidate for a non trivial health check would be whether the 
> Master has an established ZK conection.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-09-28 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9258:


Assignee: Joseph Wu

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-3484) Master failed to shutdown: failed on fd: Transport endpoint is not connected.

2018-09-28 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632221#comment-16632221
 ] 

Joseph Wu commented on MESOS-3484:
--

[~jomach] As mentioned, this log line is a red herring.  If you have problems 
launching a task, you should consult the task logs (if present) or the agent 
logs.  Feel free to open a separate issue if those logs do not help you address 
the problem.

> Master failed to shutdown: failed on fd: Transport endpoint is not connected.
> -
>
> Key: MESOS-3484
> URL: https://issues.apache.org/jira/browse/MESOS-3484
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.0
>Reporter: Chi Zhang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9217) LongLivedDefaultExecutorRestart is flaky.

2018-09-28 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632200#comment-16632200
 ] 

Joseph Wu commented on MESOS-9217:
--

Temporarily disabled:
{code}
commit 698c1498c20585089db9fca98c35bbab8da46c04
Author: Joseph Wu 
Date:   Fri Sep 28 11:09:19 2018 -0700

Disabled flaky LongLivedDefaultExecutorRestart test.

This flaky test is tracked here:
https://issues.apache.org/jira/browse/MESOS-9217
{code}

> LongLivedDefaultExecutorRestart is flaky.
> -
>
> Key: MESOS-9217
> URL: https://issues.apache.org/jira/browse/MESOS-9217
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0, 1.8.0
> Environment: Ubuntu 14.04, 16.04
>Reporter: Till Toenshoff
>Assignee: Joseph Wu
>Priority: Major
>  Labels: flaky, flaky-test, test
>
> {noformat}
> 03:52:07  [ RUN  ] 
> GarbageCollectorIntegrationTest.LongLivedDefaultExecutorRestart
> 03:52:07  I0907 03:52:07.699676  2350 cluster.cpp:173] Creating default 
> 'local' authorizer
> 03:52:07  I0907 03:52:07.700664  2374 master.cpp:413] Master 
> 8e9d97f6-4dc4-490b-81f6-d2033e2109d3 (ip-172-16-10-27.ec2.internal) started 
> on 172.16.10.27:45074
> 03:52:07  I0907 03:52:07.700690  2374 master.cpp:416] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/cuUPYo/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/cuUPYo/master" --zk_session_timeout="10secs"
> 03:52:07  I0907 03:52:07.700857  2374 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> 03:52:07  I0907 03:52:07.700870  2374 master.cpp:471] Master only allowing 
> authenticated agents to register
> 03:52:07  I0907 03:52:07.700947  2374 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> 03:52:07  I0907 03:52:07.700958  2374 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/cuUPYo/credentials'
> 03:52:07  I0907 03:52:07.701068  2374 master.cpp:521] Using default 'crammd5' 
> authenticator
> 03:52:07  I0907 03:52:07.701151  2374 http.cpp:1037] Creating default 'basic' 
> HTTP authenticator for realm 'mesos-master-readonly'
> 03:52:07  I0907 03:52:07.701254  2374 http.cpp:1037] Creating default 'basic' 
> HTTP authenticator for realm 'mesos-master-readwrite'
> 03:52:07  I0907 03:52:07.701352  2374 http.cpp:1037] Creating default 'basic' 
> HTTP authenticator for realm 'mesos-master-scheduler'
> 03:52:07  I0907 03:52:07.701445  2374 master.cpp:602] Authorization enabled
> 03:52:07  I0907 03:52:07.701566  2370 whitelist_watcher.cpp:77] No whitelist 
> given
> 03:52:07  I0907 03:52:07.701695  2376 hierarchical.cpp:182] Initialized 
> hierarchical allocator process
> 03:52:07  I0907 03:52:07.702237  2374 master.cpp:2083] Elected as the leading 
> master!
> 03:52:07  I0907 03:52:07.702255  2374 master.cpp:1638] Recovering from 
> registrar
> 03:52:07  I0907 03:52:07.702293  2375 registrar.cpp:339] Recovering registrar
> 03:52:07  I0907 03:52:07.706190  2375 registrar.cpp:383] Successfully fetched 
> the registry (0B) in 3.884032ms
> 03:52:07  I0907 03:52:07.706233  2375 registrar.cpp:487] Applied 1 operations 
> in 7967ns; attempting to update the registry
> 03:52:07  I0907 03:52:07.706378  2375 registrar.cpp:544] Successfully updated 
> the registry in 126976ns
> 03:52:07  I0907 03:52:07.706413  2375 registrar.cpp:416] Successfully 
> recovered registrar
> 03:52:07  I0907 03:52:07.706507  2375 master.cpp:1752] Recovered 0 agents 
> 

[jira] [Commented] (MESOS-7947) Add GC capability to nested containers

2018-09-05 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605066#comment-16605066
 ] 

Joseph Wu commented on MESOS-7947:
--

Also backported into 1.7.0.

> Add GC capability to nested containers
> --
>
> Key: MESOS-7947
> URL: https://issues.apache.org/jira/browse/MESOS-7947
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor
>Reporter: Chun-Hung Hsiao
>Assignee: Joseph Wu
>Priority: Major
> Fix For: 1.7.0, 1.8.0
>
>
> We should extend the existing API or add a new API for nested containers for 
> an executor to tell the Mesos agent that a nested container is no longer 
> needed and can be scheduled for GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8976) MasterTest.LaunchDuplicateOfferLost is flaky

2018-08-29 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596722#comment-16596722
 ] 

Joseph Wu edited comment on MESOS-8976 at 8/29/18 6:52 PM:
---

The {{src/test/utils.cpp:64}} helper that failed is:
{code}
JSON::Object Metrics()
{
  UPID upid("metrics", process::address());

  /* For some reason, this call times out and never completes. */
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  /* The `response->body` below is basically an unguarded `Future::get` 
because the two AWAIT calls above are the EXPECT variety, meaning 
that they do not return from the function when they fail. */
  Try parse = JSON::parse(response->body);
  CHECK_SOME(parse);

  return parse.get();
}
{code}


was (Author: kaysoky):
The {{src/test/utils.cpp:64}} helper that failed is:
{code}
JSON::Object Metrics()
{
  UPID upid("metrics", process::address());

  /* For some reason, this call times out and never completes. */
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  /* The `response->body` below is basically an unguarded `Future::get` because 
the
two AWAIT calls above are the EXPECT variety, meaning that they do not 
return
from the function when they fail. */
  Try parse = JSON::parse(response->body);
  CHECK_SOME(parse);

  return parse.get();
}
{code}

> MasterTest.LaunchDuplicateOfferLost is flaky
> 
>
> Key: MESOS-8976
> URL: https://issues.apache.org/jira/browse/MESOS-8976
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: LaunchDuplicateOfferLost.jenkins-faillog
>
>
> In an internal CI run, we observed a failure with this test where the 
> scheduler seemed to be stuck repeatedly allocating resources to the agent for 
> about 1 hour before getting timed out. See attached log for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8976) MasterTest.LaunchDuplicateOfferLost is flaky

2018-08-29 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596722#comment-16596722
 ] 

Joseph Wu commented on MESOS-8976:
--

The {{src/test/utils.cpp:64}} helper that failed is:
{code}
JSON::Object Metrics()
{
  UPID upid("metrics", process::address());

  /* For some reason, this call times out and never completes. */
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  /* The `response->body` below is basically an unguarded `Future::get` because 
the
two AWAIT calls above are the EXPECT variety, meaning that they do not 
return
from the function when they fail. */
  Try parse = JSON::parse(response->body);
  CHECK_SOME(parse);

  return parse.get();
}
{code}

> MasterTest.LaunchDuplicateOfferLost is flaky
> 
>
> Key: MESOS-8976
> URL: https://issues.apache.org/jira/browse/MESOS-8976
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: LaunchDuplicateOfferLost.jenkins-faillog
>
>
> In an internal CI run, we observed a failure with this test where the 
> scheduler seemed to be stuck repeatedly allocating resources to the agent for 
> about 1 hour before getting timed out. See attached log for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7386) Executor not cleaning up existing running docker containers if external logrotate/logger processes die/killed

2018-08-27 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594059#comment-16594059
 ] 

Joseph Wu commented on MESOS-7386:
--

I believe the problem still exists.  When the {{mesos-docker-executor}} exits 
prematurely for any reason (like someone manually killing the executor), it 
will not have the chance to stop the associated docker container.

> Executor not cleaning up existing running docker containers if external 
> logrotate/logger processes die/killed
> -
>
> Key: MESOS-7386
> URL: https://issues.apache.org/jira/browse/MESOS-7386
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, docker, executor
>Affects Versions: 0.28.2, 1.2.0
> Environment: Mesos 0.28.2/1.2.0, docker 1.12.0/17.04.0-ce, marathon 
> v1.1.2/v1.4.2 , ubuntu trusty 14.04, 
> org_apache_mesos_LogrotateContainerLogger, 
> org_apache_mesos_ExternalContainerLogger
>Reporter: Pranay Kanwar
>Priority: Critical
>
> if mesos-logrorate/external logger processes die/killed executor exits / task 
> fails and task is relaunched , but is unable to cleanup existing running 
> container.
> Logs 
> {noformat}
> slave-one_1  | I0413 12:45:17.707762  8989 status_update_manager.cpp:395] 
> Received status update acknowledgement (UUID: 
> 7262c443-e201-45f4-8de0-825d3d92c26b) for task 
> msg.dfb155bc-2046-11e7-8019-02427fa1c4d5 of framework 
> d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-
> slave-one_1  | I0413 12:45:17.707813  8989 status_update_manager.cpp:832] 
> Checkpointing ACK for status update TASK_FAILED (UUID: 
> 7262c443-e201-45f4-8de0-825d3d92c26b) for task 
> msg.dfb155bc-2046-11e7-8019-02427fa1c4d5 of framework 
> d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-
> slave-one_1  | I0413 12:45:18.615839  8991 slave.cpp:4388] Got exited event 
> for executor(1)@172.17.0.1:36471
> slave-one_1  | I0413 12:45:18.696413  8987 docker.cpp:2358] Executor for 
> container 665e86c8-ef36-4be3-b56e-3ba7edc81182 has exited
> slave-one_1  | I0413 12:45:18.696446  8987 docker.cpp:2052] Destroying 
> container 665e86c8-ef36-4be3-b56e-3ba7edc81182
> slave-one_1  | I0413 12:45:18.696482  8987 docker.cpp:2179] Running docker 
> stop on container 665e86c8-ef36-4be3-b56e-3ba7edc81182
> slave-one_1  | I0413 12:45:18.697042  8994 slave.cpp:4769] Executor 
> 'msg.dfb155bc-2046-11e7-8019-02427fa1c4d5' of framework 
> d1d616b4-1ed1-4fed-92e5-0ee3d8619be9- exited with status 0
> slave-one_1  | I0413 12:45:18.697077  8994 slave.cpp:4869] Cleaning up 
> executor 'msg.dfb155bc-2046-11e7-8019-02427fa1c4d5' of framework 
> d1d616b4-1ed1-4fed-92e5-0ee3d8619be9- at executor(1)@172.17.0.1:36471
> slave-one_1  | I0413 12:45:18.697424  8994 slave.cpp:4957] Cleaning up 
> framework d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-
> slave-one_1  | I0413 12:45:18.697530  8994 gc.cpp:55] Scheduling 
> '/tmp/mesos/agent/slaves/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-S0/frameworks/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-/executors/msg.dfb155bc-2046-11e7-8019-02427fa1c4d5/runs/665e86c8-ef36-4be3-b56e-3ba7edc81182'
>  for gc 6.9192952593days in the future
> slave-one_1  | I0413 12:45:18.697572  8994 gc.cpp:55] Scheduling 
> '/tmp/mesos/agent/slaves/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-S0/frameworks/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-/executors/msg.dfb155bc-2046-11e7-8019-02427fa1c4d5'
>  for gc 6.9192882963days in the future
> slave-one_1  | I0413 12:45:18.697607  8994 gc.cpp:55] Scheduling 
> '/tmp/mesos/agent/meta/slaves/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-S0/frameworks/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-/executors/msg.dfb155bc-2046-11e7-8019-02427fa1c4d5/runs/665e86c8-ef36-4be3-b56e-3ba7edc81182'
>  for gc 6.9192843852days in the future
> slave-one_1  | I0413 12:45:18.697628  8994 gc.cpp:55] Scheduling 
> '/tmp/mesos/agent/meta/slaves/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-S0/frameworks/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-/executors/msg.dfb155bc-2046-11e7-8019-02427fa1c4d5'
>  for gc 6.9192808889days in the future
> slave-one_1  | I0413 12:45:18.697649  8994 gc.cpp:55] Scheduling 
> '/tmp/mesos/agent/slaves/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-S0/frameworks/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-'
>  for gc 6.9192731556days in the future
> slave-one_1  | I0413 12:45:18.697670  8994 gc.cpp:55] Scheduling 
> '/tmp/mesos/agent/meta/slaves/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-S0/frameworks/d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-'
>  for gc 6.9192698963days in the future
> slave-one_1  | I0413 12:45:18.697698  8994 status_update_manager.cpp:285] 
> Closing status update streams for framework 
> d1d616b4-1ed1-4fed-92e5-0ee3d8619be9-
> {noformat}
> Container 665e86c8-ef36-4be3-b56e-3ba7edc81182 is still running
> {noformat}
> root@orobas:/# 

[jira] [Commented] (MESOS-9114) cmake build is broken on macos

2018-07-26 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558963#comment-16558963
 ] 

Joseph Wu commented on MESOS-9114:
--

The patch looks correct, as I wouldn't expect the extracted tarball to have an 
extra folder (presumably named after the tarball).

The CMake build fails for me on both OSX and Ubuntu 16 (VM) without the patch.

> cmake build is broken on macos
> --
>
> Key: MESOS-9114
> URL: https://issues.apache.org/jira/browse/MESOS-9114
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
> Environment: macos-10.13.6
>Reporter: Benjamin Bannier
>Priority: Major
>
> Since the changes for MESOS-9092 have landed it seems impossible to perform a 
> cmake-based build on macos. This seems independent of the used generator and 
> both {{make}} and {{ninja}}-based builds appear broken, e.g.,
> {noformat}
> # cmake ~/src/mesos
> # make stout-tests
> ...
>  87%] Building CXX object 
> 3rdparty/stout/tests/CMakeFiles/stout-tests.dir/uuid_tests.cpp.o
> In file included from 
> /Users/bbannier/src/mesos/3rdparty/stout/tests/json_tests.cpp:24:
> In file included from 
> /Users/bbannier/src/mesos/3rdparty/stout/include/stout/json.hpp:41:
> /Users/bbannier/src/mesos/3rdparty/stout/include/stout/jsonify.hpp:36:10: 
> fatal error: 'rapidjson/stringbuffer.h' file not found
> #include 
>  ^~
> ...
> {noformat}
> As a workaround I can apply the following patch,
> {code}
> diff --git a/3rdparty/CMakeLists.txt b/3rdparty/CMakeLists.txt
> index 9b0dfe0ab..b244267e8 100644
> --- a/3rdparty/CMakeLists.txt
> +++ b/3rdparty/CMakeLists.txt
> @@ -440,9 +440,7 @@ EXTERNAL(rapidjson ${RAPIDJSON_VERSION} 
> ${CMAKE_CURRENT_BINARY_DIR})
>  add_library(rapidjson INTERFACE)
>  add_dependencies(rapidjson ${RAPIDJSON_TARGET})
> -target_include_directories(
> -rapidjson INTERFACE
> -${RAPIDJSON_ROOT}/rapidjson-${RAPIDJSON_VERSION}/include)
> +target_include_directories(rapidjson INTERFACE ${RAPIDJSON_ROOT}/include)
>  ExternalProject_Add(
>${RAPIDJSON_TARGET}
> {code}
> This however seems to break cmake-based builds on Linux.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7947) Add GC capability to nested containers

2018-07-23 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553625#comment-16553625
 ] 

Joseph Wu commented on MESOS-7947:
--

In terms of GC-ing container sandboxes created via the LAUNCH_CONTAINER APIs, I 
think it will be relatively neat to pass the Agent's GarbageCollector to the 
Containerizer.  The Containerizer is the one with direct access to the sandbox 
directories (held within the checkpointed {{ContainerConfig}} protobufs) and 
can schedule GC whenever a container exits, or during recovery.  In future, if 
we provide a GCPolicy, that information would presumably be checkpointed into 
the {{ContainerConfig}} too; so it would be better to give the Containerizer 
access to the GarbageCollector.

This implementation should cover both nested containers and standalone 
containers.  And it would protect against the case where the user/executor 
forgets to call REMOVE_CONTAINER.

For now, the plan is to defer making framework changes.  Instead of adding a 
boolean or protobuf GCPolicy, I'll add an agent flag to tell the agent to GC 
non-executor sandboxes by default.  I don't have a nice name for this flag yet 
(currently {{--gc_non_executor_container_sandboxes}}.

---

Additionally, since the default executor (and custom executors) can be 
long-lived and run many tasks in its lifetime, we'll need to prune some of the 
Task metadata.  This is limited to directories like 
{{/meta/slaves//frameworks//executors//runs//tasks/}}.
  This metadata GC will happen for all tasks, and frameworks shouldn't need to 
change how this works.

> Add GC capability to nested containers
> --
>
> Key: MESOS-7947
> URL: https://issues.apache.org/jira/browse/MESOS-7947
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor
>Reporter: Chun-Hung Hsiao
>Assignee: Joseph Wu
>Priority: Major
>
> We should extend the existing API or add a new API for nested containers for 
> an executor to tell the Mesos agent that a nested container is no longer 
> needed and can be scheduled for GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7947) Add GC capability to nested containers

2018-07-12 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-7947:


Assignee: Joseph Wu  (was: Qian Zhang)

I see three distinct, and possibly complimentary approaches to getting GC into 
nested containers:

# We can add a new Agent API, like {{GC_NESTED_CONTAINER}}, which can be called 
after the launch of a nested container.  This call would either mark the nested 
sandbox for GC like any other sandbox, or would manually schedule the GC for 
some time in the future.
# We can extend the {{LAUNCH_NESTED_CONTAINER}} call with a field that tells 
the agent to schedule GC of the nested sandbox upon the nested container's 
exit.  We might give this field a default value to retain the current leaky 
behavior, or opt to GC by default.
# We can silently schedule nested sandboxes for GC.  We might even do so based 
on the agent's {{gc_disk_headroom}} and {{gc_delay}} and the task's {{disk}} 
resource (which means instant GC for tasks that request 0 disk).

[~qianzhang] I'm wondering if you have any thoughts about this (being the 
previous assignee).

> Add GC capability to nested containers
> --
>
> Key: MESOS-7947
> URL: https://issues.apache.org/jira/browse/MESOS-7947
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor
>Reporter: Chun-Hung Hsiao
>Assignee: Joseph Wu
>Priority: Major
>
> We should extend the existing API or add a new API for nested containers for 
> an executor to tell the Mesos agent that a nested container is no longer 
> needed and can be scheduled for GC.
> Related issue: MESOS-7939



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >