[jira] [Updated] (MESOS-5889) Flakiness in SlaveRecoveryTest

2016-08-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5889:
---
Sprint: Mesosphere Sprint 40

> Flakiness in SlaveRecoveryTest
> --
>
> Key: MESOS-5889
> URL: https://issues.apache.org/jira/browse/MESOS-5889
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>Assignee: Benjamin Mahler
>  Labels: mesosphere
> Attachments: slave_recovery_cleanup_http_executor.log, 
> slave_recovery_recover_terminated_executor.log, 
> slave_recovery_recover_unregistered_http_executor.log
>
>
> Observed on internal CI. Seems like it is related to cgroups? Observed 
> similar failures in the following tests, and probably more related tests:
> SlaveRecoveryTest/0.CleanupHTTPExecutor
> SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> Log files attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5889) Flakiness in SlaveRecoveryTest

2016-08-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-5889:
--

Assignee: Benjamin Mahler

> Flakiness in SlaveRecoveryTest
> --
>
> Key: MESOS-5889
> URL: https://issues.apache.org/jira/browse/MESOS-5889
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>Assignee: Benjamin Mahler
>  Labels: mesosphere
> Attachments: slave_recovery_cleanup_http_executor.log, 
> slave_recovery_recover_terminated_executor.log, 
> slave_recovery_recover_unregistered_http_executor.log
>
>
> Observed on internal CI. Seems like it is related to cgroups? Observed 
> similar failures in the following tests, and probably more related tests:
> SlaveRecoveryTest/0.CleanupHTTPExecutor
> SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> Log files attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)

2016-08-08 Thread AndyPang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412792#comment-15412792
 ] 

AndyPang commented on MESOS-4577:
-

Yeah, I use a "__aarch64__" macro to distinguish the AARCH64 architecture or 
X86 architecture, but i don't know why the patch is discard by mesosphere.

> libprocess can not run on 16-byte aligned stack mandatory 
> architecture(aarch64) 
> 
>
> Key: MESOS-4577
> URL: https://issues.apache.org/jira/browse/MESOS-4577
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, stout
> Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 
> 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
>Reporter: AndyPang
>Assignee: AndyPang
>  Labels: mesosphere
>
> mesos run in AArch64 will get error, the log is:
> {code}
> E0101 00:06:56.636520 32411 slave.cpp:3342] Container 
> 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor 
> 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework 
> '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork 
> executor: Failed to clone child process: Failed to clone: Invalid argument 
> {code}
> the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) 
> packaging a syscall "clone" :
> {code:title=clone|borderStyle=solid}
> inline pid_t clone(const lambda::function& func, int flags)
> {
>   // Stack for the child.
>   // - unsigned long long used for best alignment.
>   // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux.
>   //
>   // NOTE: We need to allocate the stack dynamically. This is because
>   // glibc's 'clone' will modify the stack passed to it, therefore the
>   // stack must NOT be shared as multiple 'clone's can be invoked
>   // simultaneously.
>   int stackSize = 8 * 1024 * 1024;
>   unsigned long long *stack =
> new unsigned long long[stackSize/sizeof(unsigned long long)];
>   pid_t pid = ::clone(
>   childMain,
>   [stackSize/sizeof(stack[0]) - 1],  // stack grows down.
>   flags,
>   (void*) );
>   // If CLONE_VM is not set, ::clone would create a process which runs in a
>   // separate copy of the memory space of the calling process. So we destroy 
> the
>   // stack here to avoid memory leak. If CLONE_VM is set, ::clone would 
> create a
>   // thread which runs in the same memory space with the calling process.
>   if (!(flags & CLONE_VM)) {
> delete[] stack;
>   }
>   return pid;
> }
> {code}
> syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned 
> stack mandatory architecture(aarch64) it will get error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)

2016-08-08 Thread AndyPang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412787#comment-15412787
 ] 

AndyPang commented on MESOS-4577:
-

It is really cool if it has been fixed by kernel 4.7. I temporarily modify it 
to 16 byte alignment,the patch in:
https://reviews.apache.org/r/43182/diff/1#index_header

> libprocess can not run on 16-byte aligned stack mandatory 
> architecture(aarch64) 
> 
>
> Key: MESOS-4577
> URL: https://issues.apache.org/jira/browse/MESOS-4577
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, stout
> Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 
> 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
>Reporter: AndyPang
>Assignee: AndyPang
>  Labels: mesosphere
>
> mesos run in AArch64 will get error, the log is:
> {code}
> E0101 00:06:56.636520 32411 slave.cpp:3342] Container 
> 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor 
> 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework 
> '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork 
> executor: Failed to clone child process: Failed to clone: Invalid argument 
> {code}
> the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) 
> packaging a syscall "clone" :
> {code:title=clone|borderStyle=solid}
> inline pid_t clone(const lambda::function& func, int flags)
> {
>   // Stack for the child.
>   // - unsigned long long used for best alignment.
>   // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux.
>   //
>   // NOTE: We need to allocate the stack dynamically. This is because
>   // glibc's 'clone' will modify the stack passed to it, therefore the
>   // stack must NOT be shared as multiple 'clone's can be invoked
>   // simultaneously.
>   int stackSize = 8 * 1024 * 1024;
>   unsigned long long *stack =
> new unsigned long long[stackSize/sizeof(unsigned long long)];
>   pid_t pid = ::clone(
>   childMain,
>   [stackSize/sizeof(stack[0]) - 1],  // stack grows down.
>   flags,
>   (void*) );
>   // If CLONE_VM is not set, ::clone would create a process which runs in a
>   // separate copy of the memory space of the calling process. So we destroy 
> the
>   // stack here to avoid memory leak. If CLONE_VM is set, ::clone would 
> create a
>   // thread which runs in the same memory space with the calling process.
>   if (!(flags & CLONE_VM)) {
> delete[] stack;
>   }
>   return pid;
> }
> {code}
> syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned 
> stack mandatory architecture(aarch64) it will get error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-5830:

Shepherd: Michael Park

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Gaojin CAO
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Commented] (MESOS-6009) Design doc for task groups

2016-08-08 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412736#comment-15412736
 ] 

Vinod Kone commented on MESOS-6009:
---

https://docs.google.com/document/d/1FtcyQkDfGp-bPHTW4pUoqQCgVlPde936bo-IIENO_ho/edit#heading=h.ip4t59nlogfz

> Design doc for task groups
> --
>
> Key: MESOS-6009
> URL: https://issues.apache.org/jira/browse/MESOS-6009
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Jie Yu
>
> This ticket tracks the design for implementing task groups which can be used 
> to deliver pod like semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6009) Design doc for task groups

2016-08-08 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-6009:
-

 Summary: Design doc for task groups
 Key: MESOS-6009
 URL: https://issues.apache.org/jira/browse/MESOS-6009
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Jie Yu


This ticket tracks the design for implementing task groups which can be used to 
deliver pod like semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Gaojin CAO (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412734#comment-15412734
 ] 

Gaojin CAO commented on MESOS-5830:
---

ok, thanks.

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Gaojin CAO
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Commented] (MESOS-5889) Flakiness in SlaveRecoveryTest

2016-08-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412667#comment-15412667
 ] 

Benjamin Mahler commented on MESOS-5889:


[~neilc] we try to include good vs bad logs in flaky test reports as it lowers 
the barrier to looking into the issue (don't need to go digging around CI or 
compile / run it locally to get logs for a good run). For example: MESOS-4800.

> Flakiness in SlaveRecoveryTest
> --
>
> Key: MESOS-5889
> URL: https://issues.apache.org/jira/browse/MESOS-5889
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere
> Attachments: slave_recovery_cleanup_http_executor.log, 
> slave_recovery_recover_terminated_executor.log, 
> slave_recovery_recover_unregistered_http_executor.log
>
>
> Observed on internal CI. Seems like it is related to cgroups? Observed 
> similar failures in the following tests, and probably more related tests:
> SlaveRecoveryTest/0.CleanupHTTPExecutor
> SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> Log files attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6008) Add the infrastructure for a new python-based CLI.

2016-08-08 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-6008:
--

 Summary: Add the infrastructure for a new python-based CLI.
 Key: MESOS-6008
 URL: https://issues.apache.org/jira/browse/MESOS-6008
 Project: Mesos
  Issue Type: Improvement
Reporter: Kevin Klues
Assignee: Kevin Klues






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Gaojin CAO (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412640#comment-15412640
 ] 

Gaojin CAO commented on MESOS-5830:
---

https://reviews.apache.org/r/50887/
https://reviews.apache.org/r/50899/
https://reviews.apache.org/r/50900/

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Gaojin CAO
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Assigned] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Gaojin CAO (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaojin CAO reassigned MESOS-5830:
-

Assignee: Gaojin CAO

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Gaojin CAO
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Updated] (MESOS-5988) PollSocketImpl can write to a stale fd.

2016-08-08 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5988:
-
Shepherd: Benjamin Mahler

> PollSocketImpl can write to a stale fd.
> ---
>
> Key: MESOS-5988
> URL: https://issues.apache.org/jira/browse/MESOS-5988
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1
>
>
> When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We 
> were curious why PollSocketImpl avoids the same issue. It seems that 
> PollSocketImpl has a similar race, however in the case of PollSocketImpl we 
> will simply write to a stale file descriptor.
> One example is {{PollSocketImpl::send(const char*, size_t)}}:
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245
> {code}
> Future PollSocketImpl::send(const char* data, size_t size)
> {
>   return io::poll(get(), io::WRITE)
> .then(lambda::bind(::socket_send_data, get(), data, size));
> }
> Future socket_send_data(int s, const char* data, size_t size)
> {
>   CHECK(size > 0);
>   while (true) {
> ssize_t length = send(s, data, size, MSG_NOSIGNAL);
> #ifdef __WINDOWS__
> int error = WSAGetLastError();
> #else
> int error = errno;
> #endif // __WINDOWS__
> if (length < 0 && net::is_restartable_error(error)) {
>   // Interrupted, try again now.
>   continue;
> } else if (length < 0 && net::is_retryable_error(error)) {
>   // Might block, try again later.
>   return io::poll(s, io::WRITE)
> .then(lambda::bind(::socket_send_data, s, data, size));
> } else if (length <= 0) {
>   // Socket error or closed.
>   if (length < 0) {
> const string error = os::strerror(errno);
> VLOG(1) << "Socket error while sending: " << error;
>   } else {
> VLOG(1) << "Socket closed while sending";
>   }
>   if (length == 0) {
> return length;
>   } else {
> return Failure(ErrnoError("Socket send failed"));
>   }
> } else {
>   CHECK(length > 0);
>   return length;
> }
>   }
> }
> {code}
> If the last reference to the {{Socket}} goes away before the 
> {{socket_send_data}} loop completes, then we will write to a stale fd!
> It turns out that we have avoided this issue because in libprocess we happen 
> to keep a reference to the {{Socket}} around when sending:
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707
> {code}
> void send(Encoder* encoder, Socket socket)
> {
>   switch (encoder->kind()) {
> case Encoder::DATA: {
>   size_t size;
>   const char* data = static_cast(encoder)->next();
>   socket.send(data, size)
> .onAny(lambda::bind(
> ::_send,
> lambda::_1,
> socket,
> encoder,
> size));
>   break;
> }
> case Encoder::FILE: {
>   off_t offset;
>   size_t size;
>   int fd = static_cast(encoder)->next(, );
>   socket.sendfile(fd, offset, size)
> .onAny(lambda::bind(
> ::_send,
> lambda::_1,
> socket,
> encoder,
> size));
>   break;
> }
>   }
> }
> {code}
> However, this may not be true in all call-sites going forward. Currently, it 
> appears that http::Connection can trigger this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6007) Operator API v1 Improvements

2016-08-08 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-6007:
-

 Summary: Operator API v1 Improvements
 Key: MESOS-6007
 URL: https://issues.apache.org/jira/browse/MESOS-6007
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone


This is follow up epic to track the improvement work from MESOS-4791.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6006) Abstract mesos-style.py to allow future linters to be added more easily

2016-08-08 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-6006:
--

 Summary: Abstract mesos-style.py to allow future linters to be 
added more easily
 Key: MESOS-6006
 URL: https://issues.apache.org/jira/browse/MESOS-6006
 Project: Mesos
  Issue Type: Improvement
Reporter: Kevin Klues
Assignee: Kevin Klues
 Fix For: 1.1.0


Currently, mesos-style.py is just a collection of functions that
check the style of relevant files in the mesos code base.  However,
the script assumes that we always wanted to run cpplint over every
file we are checking. Since we are planning on adding a python linter
to the codebase soon, it makes sense to abstract the common
functionality from this script into a class so that a cpp-based linter
and a python-based linter can inherit the same set of common
functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5988) PollSocketImpl can write to a stale fd.

2016-08-08 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5988:
-
  Sprint: Mesosphere Sprint 40
Story Points: 3

> PollSocketImpl can write to a stale fd.
> ---
>
> Key: MESOS-5988
> URL: https://issues.apache.org/jira/browse/MESOS-5988
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1
>
>
> When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We 
> were curious why PollSocketImpl avoids the same issue. It seems that 
> PollSocketImpl has a similar race, however in the case of PollSocketImpl we 
> will simply write to a stale file descriptor.
> One example is {{PollSocketImpl::send(const char*, size_t)}}:
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245
> {code}
> Future PollSocketImpl::send(const char* data, size_t size)
> {
>   return io::poll(get(), io::WRITE)
> .then(lambda::bind(::socket_send_data, get(), data, size));
> }
> Future socket_send_data(int s, const char* data, size_t size)
> {
>   CHECK(size > 0);
>   while (true) {
> ssize_t length = send(s, data, size, MSG_NOSIGNAL);
> #ifdef __WINDOWS__
> int error = WSAGetLastError();
> #else
> int error = errno;
> #endif // __WINDOWS__
> if (length < 0 && net::is_restartable_error(error)) {
>   // Interrupted, try again now.
>   continue;
> } else if (length < 0 && net::is_retryable_error(error)) {
>   // Might block, try again later.
>   return io::poll(s, io::WRITE)
> .then(lambda::bind(::socket_send_data, s, data, size));
> } else if (length <= 0) {
>   // Socket error or closed.
>   if (length < 0) {
> const string error = os::strerror(errno);
> VLOG(1) << "Socket error while sending: " << error;
>   } else {
> VLOG(1) << "Socket closed while sending";
>   }
>   if (length == 0) {
> return length;
>   } else {
> return Failure(ErrnoError("Socket send failed"));
>   }
> } else {
>   CHECK(length > 0);
>   return length;
> }
>   }
> }
> {code}
> If the last reference to the {{Socket}} goes away before the 
> {{socket_send_data}} loop completes, then we will write to a stale fd!
> It turns out that we have avoided this issue because in libprocess we happen 
> to keep a reference to the {{Socket}} around when sending:
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707
> {code}
> void send(Encoder* encoder, Socket socket)
> {
>   switch (encoder->kind()) {
> case Encoder::DATA: {
>   size_t size;
>   const char* data = static_cast(encoder)->next();
>   socket.send(data, size)
> .onAny(lambda::bind(
> ::_send,
> lambda::_1,
> socket,
> encoder,
> size));
>   break;
> }
> case Encoder::FILE: {
>   off_t offset;
>   size_t size;
>   int fd = static_cast(encoder)->next(, );
>   socket.sendfile(fd, offset, size)
> .onAny(lambda::bind(
> ::_send,
> lambda::_1,
> socket,
> encoder,
> size));
>   break;
> }
>   }
> }
> {code}
> However, this may not be true in all call-sites going forward. Currently, it 
> appears that http::Connection can trigger this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5988) PollSocketImpl can write to a stale fd.

2016-08-08 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-5988:


Assignee: Greg Mann

> PollSocketImpl can write to a stale fd.
> ---
>
> Key: MESOS-5988
> URL: https://issues.apache.org/jira/browse/MESOS-5988
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1
>
>
> When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We 
> were curious why PollSocketImpl avoids the same issue. It seems that 
> PollSocketImpl has a similar race, however in the case of PollSocketImpl we 
> will simply write to a stale file descriptor.
> One example is {{PollSocketImpl::send(const char*, size_t)}}:
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245
> {code}
> Future PollSocketImpl::send(const char* data, size_t size)
> {
>   return io::poll(get(), io::WRITE)
> .then(lambda::bind(::socket_send_data, get(), data, size));
> }
> Future socket_send_data(int s, const char* data, size_t size)
> {
>   CHECK(size > 0);
>   while (true) {
> ssize_t length = send(s, data, size, MSG_NOSIGNAL);
> #ifdef __WINDOWS__
> int error = WSAGetLastError();
> #else
> int error = errno;
> #endif // __WINDOWS__
> if (length < 0 && net::is_restartable_error(error)) {
>   // Interrupted, try again now.
>   continue;
> } else if (length < 0 && net::is_retryable_error(error)) {
>   // Might block, try again later.
>   return io::poll(s, io::WRITE)
> .then(lambda::bind(::socket_send_data, s, data, size));
> } else if (length <= 0) {
>   // Socket error or closed.
>   if (length < 0) {
> const string error = os::strerror(errno);
> VLOG(1) << "Socket error while sending: " << error;
>   } else {
> VLOG(1) << "Socket closed while sending";
>   }
>   if (length == 0) {
> return length;
>   } else {
> return Failure(ErrnoError("Socket send failed"));
>   }
> } else {
>   CHECK(length > 0);
>   return length;
> }
>   }
> }
> {code}
> If the last reference to the {{Socket}} goes away before the 
> {{socket_send_data}} loop completes, then we will write to a stale fd!
> It turns out that we have avoided this issue because in libprocess we happen 
> to keep a reference to the {{Socket}} around when sending:
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707
> {code}
> void send(Encoder* encoder, Socket socket)
> {
>   switch (encoder->kind()) {
> case Encoder::DATA: {
>   size_t size;
>   const char* data = static_cast(encoder)->next();
>   socket.send(data, size)
> .onAny(lambda::bind(
> ::_send,
> lambda::_1,
> socket,
> encoder,
> size));
>   break;
> }
> case Encoder::FILE: {
>   off_t offset;
>   size_t size;
>   int fd = static_cast(encoder)->next(, );
>   socket.sendfile(fd, offset, size)
> .onAny(lambda::bind(
> ::_send,
> lambda::_1,
> socket,
> encoder,
> size));
>   break;
> }
>   }
> }
> {code}
> However, this may not be true in all call-sites going forward. Currently, it 
> appears that http::Connection can trigger this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412268#comment-15412268
 ] 

Gilbert Song commented on MESOS-5830:
-

[~zerobleed], welcome to the community! Please quickly go through the doc 
[~haosd...@gmail.com] pasted above. You may need a shepherd before for this 
JIRA. Feel free to join the community slack channel (mesos.slack.com). You can 
get quick answers if asking questions there. :)

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Commented] (MESOS-5991) Support running docker daemon inside a container using unified containerizer.

2016-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412260#comment-15412260
 ] 

Stéphane Cottin commented on MESOS-5991:


standalone.
I plan to migrate to mesos plugin, when it will be compatible with mesos >= 
1.0.0 and unified containerizer.

> Support running docker daemon inside a container using unified containerizer.
> -
>
> Key: MESOS-5991
> URL: https://issues.apache.org/jira/browse/MESOS-5991
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>
> The goal is to develop necessary pieces in unified containerizer so that 
> framework can launch a full fledge docker daemon in a container.
> This will be useful for frameworks like jenkins. The jenkins job can still 
> use docker cli to do build (e.g., `docker build`, `docker push`), but we 
> don't have to install docker daemon on the host anymore.
> Looks like LXD already support that and is pretty stable for some users. We 
> should do some investigation to see what features that's missing in unified 
> containerizer to be able to match what lxd has. Will track all the 
> dependencies in this ticket.
> https://www.stgraber.org/2016/04/13/lxd-2-0-docker-in-lxd-712/
> Cgroups and user namespaces support are definitely missing pieces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5991) Support running docker daemon inside a container using unified containerizer.

2016-08-08 Thread Sunil Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412245#comment-15412245
 ] 

Sunil Shah commented on MESOS-5991:
---

[~kaalh]: are you using this Docker image with the Jenkins Mesos plugin or 
running it standalone?

> Support running docker daemon inside a container using unified containerizer.
> -
>
> Key: MESOS-5991
> URL: https://issues.apache.org/jira/browse/MESOS-5991
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>
> The goal is to develop necessary pieces in unified containerizer so that 
> framework can launch a full fledge docker daemon in a container.
> This will be useful for frameworks like jenkins. The jenkins job can still 
> use docker cli to do build (e.g., `docker build`, `docker push`), but we 
> don't have to install docker daemon on the host anymore.
> Looks like LXD already support that and is pretty stable for some users. We 
> should do some investigation to see what features that's missing in unified 
> containerizer to be able to match what lxd has. Will track all the 
> dependencies in this ticket.
> https://www.stgraber.org/2016/04/13/lxd-2-0-docker-in-lxd-712/
> Cgroups and user namespaces support are definitely missing pieces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-08-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5986:
---
Affects Version/s: 1.0.0

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1
>
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up 
> (89)@127.0.0.1:55688
> I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in 
> send_callback(bev)
> I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to 
> (86)@127.0.0.1:55688 while waiting
> I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming 
> (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00
> I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in 
> send_callback(): 17
> I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up 
> (76)@127.0.0.1:55688
> I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming 
> (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00
> I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up 
> (86)@127.0.0.1:55688
> I0804 08:32:33.296497 275939328 

[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-08-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5986:
---
Fix Version/s: 1.1.0

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1
>
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up 
> (89)@127.0.0.1:55688
> I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in 
> send_callback(bev)
> I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to 
> (86)@127.0.0.1:55688 while waiting
> I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming 
> (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00
> I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in 
> send_callback(): 17
> I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up 
> (76)@127.0.0.1:55688
> I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming 
> (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00
> I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up 
> (86)@127.0.0.1:55688
> I0804 08:32:33.296497 275939328 

[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-08-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5986:
---
Fix Version/s: (was: 1.1.0)

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1
>
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up 
> (89)@127.0.0.1:55688
> I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in 
> send_callback(bev)
> I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to 
> (86)@127.0.0.1:55688 while waiting
> I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming 
> (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00
> I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in 
> send_callback(): 17
> I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up 
> (76)@127.0.0.1:55688
> I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming 
> (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00
> I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up 
> (86)@127.0.0.1:55688
> I0804 08:32:33.296497 275939328 

[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-08-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5986:
---
Fix Version/s: 1.0.1

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.1, 1.1.0
>
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up 
> (89)@127.0.0.1:55688
> I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in 
> send_callback(bev)
> I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to 
> (86)@127.0.0.1:55688 while waiting
> I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming 
> (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00
> I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in 
> send_callback(): 17
> I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up 
> (76)@127.0.0.1:55688
> I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming 
> (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00
> I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up 
> (86)@127.0.0.1:55688
> I0804 08:32:33.296497 275939328 

[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412219#comment-15412219
 ] 

Gilbert Song commented on MESOS-6004:
-

Ok, answer the #3. Just saw it from the slack channel. Your image contains 
about 55 layers.

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=0, as with 1 it was too long)
> {noformat}
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 
> 30961 main.cpp:434] Starting Mesos agent
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 
> 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 
> 30961 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
>  --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/mnt/mesos/store/docker" --do
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
> cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="3mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" 
> --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_providers="docker" 
> --image_provisioner_backend="copy" 

[jira] [Commented] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-08-08 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412203#comment-15412203
 ] 

Greg Mann commented on MESOS-5986:
--

{code}
commit f5822f3c13f4fdacbb390341940d3379248a9837
Author: Greg Mann g...@mesosphere.io
Date:   Fri Aug 5 18:19:33 2016 -0700

Removed incorrect CHECK in SSL socket `send()`.

The lambda placed on the event loop by the libevent SSL
socket's `send()` method previously used a `CHECK` to
ensure that the socket's `send_request` member was not
`nullptr`. This patch removes this check, since
`send_request` may become `nullptr` any time the socket
receives an EOF or ERROR event.

Note that the current handling of events is incorrect
also, but we do not attempt a fix here. To be specific,
reading EOF should not deal with send requests at all
(see MESOS-5999). Also, the ERROR events are not
differentiated between reading and writing. Lastly,
when we receive an EOF we do not ensure that the caller
can read the bytes that remain in the buffer!

Review: https://reviews.apache.org/r/50741/
{code}

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 

[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-08-08 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5986:
-
Fix Version/s: (was: 1.0.1)

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up 
> (89)@127.0.0.1:55688
> I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in 
> send_callback(bev)
> I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to 
> (86)@127.0.0.1:55688 while waiting
> I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming 
> (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00
> I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in 
> send_callback(): 17
> I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up 
> (76)@127.0.0.1:55688
> I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming 
> (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00
> I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up 
> (86)@127.0.0.1:55688
> I0804 08:32:33.296497 275939328 libevent_ssl_socket.cpp:104] *** releasing 
> SSL socket
> I0804 

[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412187#comment-15412187
 ] 

Gilbert Song commented on MESOS-6004:
-

3. Attach the approx. image layers number as well. Appreciated!

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=0, as with 1 it was too long)
> {noformat}
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 
> 30961 main.cpp:434] Starting Mesos agent
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 
> 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 
> 30961 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
>  --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/mnt/mesos/store/docker" --do
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
> cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="3mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" 
> --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_providers="docker" 
> --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> 

[jira] [Comment Edited] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412159#comment-15412159
 ] 

Gilbert Song edited comment on MESOS-6004 at 8/8/16 5:56 PM:
-

Thanks [~mito]. We need to fix this issue. Most likely this is because the 
image size is too large, and it takes time to download/copy. Could you please:

1. Just for curious, could you test using localpuller& backend 
(--docker_registry=/path/to/your/image/tarballs/folder and 
--image_provider_backend=overlay). Want to know whether you still have the 
scheduling issue.

2. Attach the GLOG_v=1 log, should be fine in size if you are using `noformat`.


was (Author: gilbert):
Thanks [~mito]. We need to fix this issue. Most likely this is because the 
image size is too large, and it takes time to download/copy. Could you please:

1. Just for curious, could you test using localpuller& backend 
(`--docker_registry=/path/to/your/image/tarballs/folder` and 
`--image_provider_backend=overlay`). Want to know whether you still have the 
scheduling issue.

2. Attach the GLOG_v=1 log, should be fine in size if you are using `noformat`.

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=0, as with 1 it was too long)
> {noformat}
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 
> 30961 main.cpp:434] Starting Mesos agent
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 
> 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 
> 30961 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
>  

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6004:

Affects Version/s: 0.28.2

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=0, as with 1 it was too long)
> {noformat}
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 
> 30961 main.cpp:434] Starting Mesos agent
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 
> 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 
> 30961 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
>  --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/mnt/mesos/store/docker" --do
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
> cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="3mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" 
> --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_providers="docker" 
> --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
> 

[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412159#comment-15412159
 ] 

Gilbert Song commented on MESOS-6004:
-

Thanks [~mito]. We need to fix this issue. Most likely this is because the 
image size is too large, and it takes time to download/copy. Could you please:

1. Just for curious, could you test using localpuller& backend 
(`--docker_registry=/path/to/your/image/tarballs/folder` and 
`--image_provider_backend=overlay`). Want to know whether you still have the 
scheduling issue.

2. Attach the GLOG_v=1 log, should be fine in size if you are using `noformat`.

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=0, as with 1 it was too long)
> {noformat}
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 
> 30961 main.cpp:434] Starting Mesos agent
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 
> 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 
> 30961 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
>  --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/mnt/mesos/store/docker" --do
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
> cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="3mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> 

[jira] [Updated] (MESOS-6003) Add logging module for logging to an external program

2016-08-08 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6003:
-
Shepherd: Joseph Wu

> Add logging module for logging to an external program
> -
>
> Key: MESOS-6003
> URL: https://issues.apache.org/jira/browse/MESOS-6003
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Will Rouesnel
>Assignee: Will Rouesnel
>Priority: Minor
>
> In the vein of the logrotate module for logging, there should be a similar 
> module which provides support for logging to an arbitrary log handling 
> program, with suitable task metadata provided by environment variables or 
> command line arguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6005) Support docker registry running non-https on localhost:

2016-08-08 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412136#comment-15412136
 ] 

Gilbert Song commented on MESOS-6005:
-

Thanks [~zhitao], we will address it.

> Support docker registry running non-https on localhost:
> 
>
> Key: MESOS-6005
> URL: https://issues.apache.org/jira/browse/MESOS-6005
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>
> (Please update title with whatever this ended up)
> Docker daemon by default does not use https if the registry host is 
> localhost/127.0.0.1, which is what many people use in dev testing or alike.
> Right now image fetching only support plain http if port is 80. Ideally this 
> can be configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)

2016-08-08 Thread gtin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412091#comment-15412091
 ] 

gtin commented on MESOS-4577:
-

I got it to work temporarily until there is a mainline kernel for odroid c2 
that supports this.  The temporary change I made was in file
mesos/3rdparty/stout/include/stout/os/linux.hpp

changed the stack type from unsigned long long to long double to provide a 16 
byte alignment.

  long double *stack =
new long double[stackSize/sizeof(long double)];

  pid_t pid = ::clone(
  childMain,
  [stackSize/sizeof(stack[0]) - 1],  // stack grows down.
  flags,
  (void*) );

> libprocess can not run on 16-byte aligned stack mandatory 
> architecture(aarch64) 
> 
>
> Key: MESOS-4577
> URL: https://issues.apache.org/jira/browse/MESOS-4577
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, stout
> Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 
> 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
>Reporter: AndyPang
>Assignee: AndyPang
>  Labels: mesosphere
>
> mesos run in AArch64 will get error, the log is:
> {code}
> E0101 00:06:56.636520 32411 slave.cpp:3342] Container 
> 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor 
> 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework 
> '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork 
> executor: Failed to clone child process: Failed to clone: Invalid argument 
> {code}
> the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) 
> packaging a syscall "clone" :
> {code:title=clone|borderStyle=solid}
> inline pid_t clone(const lambda::function& func, int flags)
> {
>   // Stack for the child.
>   // - unsigned long long used for best alignment.
>   // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux.
>   //
>   // NOTE: We need to allocate the stack dynamically. This is because
>   // glibc's 'clone' will modify the stack passed to it, therefore the
>   // stack must NOT be shared as multiple 'clone's can be invoked
>   // simultaneously.
>   int stackSize = 8 * 1024 * 1024;
>   unsigned long long *stack =
> new unsigned long long[stackSize/sizeof(unsigned long long)];
>   pid_t pid = ::clone(
>   childMain,
>   [stackSize/sizeof(stack[0]) - 1],  // stack grows down.
>   flags,
>   (void*) );
>   // If CLONE_VM is not set, ::clone would create a process which runs in a
>   // separate copy of the memory space of the calling process. So we destroy 
> the
>   // stack here to avoid memory leak. If CLONE_VM is set, ::clone would 
> create a
>   // thread which runs in the same memory space with the calling process.
>   if (!(flags & CLONE_VM)) {
> delete[] stack;
>   }
>   return pid;
> }
> {code}
> syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned 
> stack mandatory architecture(aarch64) it will get error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Michael Thomas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it was too long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="false" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 
slave.cpp:527] Agent attributes: [  ]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 
status_update_manager.cpp:200] Recovering status update manager
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 
containerizer.cpp:522] Recovering 

[jira] [Created] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Michael Thomas (JIRA)
Michael Thomas created MESOS-6004:
-

 Summary:  Tasks fail when provisioning multiple containers with 
large docker images using copy backend
 Key: MESOS-6004
 URL: https://issues.apache.org/jira/browse/MESOS-6004
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.0.0
 Environment: h4. Agent Platform

- Ubuntu 16.04
- AWS g2.x2large instance
- Nvidia support enabled

h4. Agent Configuration

-{noformat}
--containerizers=mesos,docker
--docker_config=
--docker_store_dir=/mnt/mesos/store/docker
--executor_registration_timeout=3mins
--hostname=
--image_providers=docker
--image_provisioner_backend=copy
--isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
--switch_user=false
--work_dir=/mnt/mesos
{noformat}

h4. Framework

- custom framework written in python
- using unified containerizer with docker images

h4. Test Setup

* 1 master
* 1 agent
* 5 tasks scheduled at the same time:
** resources: cpus: 0.1, mem: 128
** command: `echo test`
** docker image: custom docker image, based on nvidia/cuda ~5gb
** the same docker image was for all tasks, already pulled.

Reporter: Michael Thomas


When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method `copy` is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the `executor_registration_timeout` parameter, seemed to help a bit 
as it allowed to schedule at least 2 tasks at the same time. But still fails 
with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it to long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true" 

[jira] [Updated] (MESOS-6003) Add logging module for logging to an external program

2016-08-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6003:
--
Assignee: Will Rouesnel

> Add logging module for logging to an external program
> -
>
> Key: MESOS-6003
> URL: https://issues.apache.org/jira/browse/MESOS-6003
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Will Rouesnel
>Assignee: Will Rouesnel
>Priority: Minor
>
> In the vein of the logrotate module for logging, there should be a similar 
> module which provides support for logging to an arbitrary log handling 
> program, with suitable task metadata provided by environment variables or 
> command line arguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5028) Copy provisioner cannot replace directory with symlink

2016-08-08 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412053#comment-15412053
 ] 

Zhitao Li commented on MESOS-5028:
--

One thing I forgot to mention is that I did a {{docker save}} to tar file, and 
used local store registry option when performing the test. The problematic 
later I generated does not have a extra whiteout file in such a cast:

/quote
zhitao@zhitao-mesos1:~/mesos/build$ ls -alR 
/t/layers/90e46350e512b827e8fe73a053ededc13f7eb1bccca96dc8ef86d6a6cd98f29c/rootfs/
/t/layers/90e46350e512b827e8fe73a053ededc13f7eb1bccca96dc8ef86d6a6cd98f29c/rootfs/:
total 12
drwxr-xr-x 3 root root 4096 Aug  8 16:36 .
drwxr-xr-x 3 root root 4096 Aug  8 16:36 ..
drwxrwxr-x 2 root root 4096 Aug  5 20:01 etc

/t/layers/90e46350e512b827e8fe73a053ededc13f7eb1bccca96dc8ef86d6a6cd98f29c/rootfs/etc:
total 8
drwxrwxr-x 2 root root 4096 Aug  5 20:01 .
drwxr-xr-x 3 root root 4096 Aug  8 16:36 ..
lrwxrwxrwx 1 root root4 Aug  5 20:01 cirros -> /tmp
/quote

> Copy provisioner cannot replace directory with symlink
> --
>
> Key: MESOS-5028
> URL: https://issues.apache.org/jira/browse/MESOS-5028
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Gilbert Song
>
> I'm trying to play with the new image provisioner on our custom docker 
> images, but one of layer failed to get copied, possibly due to a dangling 
> symlink.
> Error log with Glog_v=1:
> {quote}
> I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
> '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
>  to rootfs 
> '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
> E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
> '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
> 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot overwrite directory 
> ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
>  with non-directory
> {quote}
> Content of 
> _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
>  points to a non-existing absolute path (cannot provide exact path but it's a 
> result of us trying to mount apt keys into docker container at build time).
> I believe what happened is that we executed a script at build time, which 
> contains equivalent of:
> {quote}
> rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Gaojin CAO (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412052#comment-15412052
 ] 

Gaojin CAO commented on MESOS-5830:
---

Sure, done!

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)

2016-08-08 Thread gtin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411969#comment-15411969
 ] 

gtin commented on MESOS-4577:
-

It seems this issue was fixed in the latest kernel 4.7.  It does not enforce 16 
byte alignment anymore. 
https://github.com/torvalds/linux/blob/v4.7/arch/arm64/kernel/process.c
https://patchwork.codeaurora.org/patch/13893/

It would be nice to have a work around for us folks stuck on old kernels.

> libprocess can not run on 16-byte aligned stack mandatory 
> architecture(aarch64) 
> 
>
> Key: MESOS-4577
> URL: https://issues.apache.org/jira/browse/MESOS-4577
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, stout
> Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 
> 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
>Reporter: AndyPang
>Assignee: AndyPang
>  Labels: mesosphere
>
> mesos run in AArch64 will get error, the log is:
> {code}
> E0101 00:06:56.636520 32411 slave.cpp:3342] Container 
> 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor 
> 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework 
> '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork 
> executor: Failed to clone child process: Failed to clone: Invalid argument 
> {code}
> the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) 
> packaging a syscall "clone" :
> {code:title=clone|borderStyle=solid}
> inline pid_t clone(const lambda::function& func, int flags)
> {
>   // Stack for the child.
>   // - unsigned long long used for best alignment.
>   // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux.
>   //
>   // NOTE: We need to allocate the stack dynamically. This is because
>   // glibc's 'clone' will modify the stack passed to it, therefore the
>   // stack must NOT be shared as multiple 'clone's can be invoked
>   // simultaneously.
>   int stackSize = 8 * 1024 * 1024;
>   unsigned long long *stack =
> new unsigned long long[stackSize/sizeof(unsigned long long)];
>   pid_t pid = ::clone(
>   childMain,
>   [stackSize/sizeof(stack[0]) - 1],  // stack grows down.
>   flags,
>   (void*) );
>   // If CLONE_VM is not set, ::clone would create a process which runs in a
>   // separate copy of the memory space of the calling process. So we destroy 
> the
>   // stack here to avoid memory leak. If CLONE_VM is set, ::clone would 
> create a
>   // thread which runs in the same memory space with the calling process.
>   if (!(flags & CLONE_VM)) {
> delete[] stack;
>   }
>   return pid;
> }
> {code}
> syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned 
> stack mandatory architecture(aarch64) it will get error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-5830:

Labels: mesosphere newbie  (was: )

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets

2016-08-08 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411797#comment-15411797
 ] 

Benjamin Bannier commented on MESOS-5830:
-

[~zerobleed] I see you have already posted a patch 
(https://reviews.apache.org/r/50887/). Could you please first get yourself 
added as a contributor so you could then assign this ticket to yourself? After 
that you could post a link to the review and move this ticket to a reviewable 
state.

> Make a sweep to trim excess space around angle brackets
> ---
>
> Key: MESOS-5830
> URL: https://issues.apache.org/jira/browse/MESOS-5830
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Trivial
>
> The codebase still has pre-C++11 code where we needed to say e.g., 
> {{vector

[jira] [Commented] (MESOS-5536) Completed executors presented as alive

2016-08-08 Thread Tomasz Janiszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411757#comment-15411757
 ] 

Tomasz Janiszewski commented on MESOS-5536:
---

After updating to 0.28.2 completed executors still shown up. I'll delete them 
manually and monitor if new appears.

> Completed executors presented as alive
> --
>
> Key: MESOS-5536
> URL: https://issues.apache.org/jira/browse/MESOS-5536
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: Ubuntu 14.04.3 LTS
>Reporter: Tomasz Janiszewski
>
> I'm running Mesos 0.28.0. Mesos {{slave(1)/state}} endpoint returns some 
> completed executors not in frameworks.completed_executors but in 
> frameworks.executors. Alsa this executor presents in {{monitor/statistics}}
> {code:JavaScript:title=slave(1)/state}
> {
> "attributes": {...},
> "completed_frameworks": [],
> "flags": {...},
> "frameworks": [
> {
> "checkpoint": true,
> "completed_executors": [...],
> "executors": [
>   {
>   "queued_tasks": [],
>   "tasks": [],
>   "completed_tasks": [
>   {
>   "discovery": {...},
>   "executor_id": "",
>   "framework_id": 
> "f65b163c-0faf-441f-ac14-91739fa4394c-",
>   "id": 
> "service.a3b609b8-27ec-11e6-8044-02c89eb9127e",
>   "labels": [...],
>   "name": "service",
>   "resources": {...},
>   "slave_id": 
> "ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13",
>   "state": "TASK_KILLED",
>   "statuses": []
>   }
>   ],
>   "container": "ead42e63-ac92-4ad0-a99c-4af9c3fa5e31",
>   "directory": "...",
>   "id": "service.a3b609b8-27ec-11e6-8044-02c89eb9127e",
>   "name": "Command Executor (Task: 
> service.a3b609b8-27ec-11e6-8044-02c89eb9127e) (Command: sh -c 'cd 
> service...')",  
>   "resources": {...},
>   "source": "service.a3b609b8-27ec-11e6-8044-02c89eb9127e"
>   
>   },
>   ...
> ],
> }
> ],
> "git_sha": "961edbd82e691a619a4c171a7aadc9c32957fa73",
> "git_tag": "0.28.0",
> "version": "0.28.0",
> ...
> }
> {code}
> {code:title="var/log/mesos/mesos-slave.INFO"}
> 13:33:19.479182  [slave.cpp:1361] Got assigned task 
> service.a3b609b8-27ec-11e6-8044-02c89eb9127e for framework 
> f65b163c-0faf-441f-ac14-91739fa4394c-
> 13:33:19.482566  [slave.cpp:1480] Launching task 
> service.a3b609b8-27ec-11e6-8044-02c89eb9127e for framework 
> f65b163c-0faf-441f-ac14-91739fa4394c-
> 13:33:19.483921  [paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31'
>  to user 'mesosuser'
> 13:33:19.504173  [slave.cpp:5367] Launching executor 
> service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework 
> f65b163c-0faf-441f-ac14-91739fa4394c- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/tmp/mesos/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31'
> 13:33:19.505537  [containerizer.cpp:666] Starting container 
> 'ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' for executor 
> 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' of framework 
> 'f65b163c-0faf-441f-ac14-91739fa4394c-'
> 13:33:19.505734  [slave.cpp:1698] Queuing task 
> 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' for executor 
> 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' of framework 
> f65b163c-0faf-441f-ac14-91739fa4394c-
> ...
> 13:33:19.977483  [containerizer.cpp:1118] Checkpointing executor's forked pid 
> 25576 to 
> '/tmp/mesos/meta/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31/pids/forked.pid'
> 13:33:35.775195  [slave.cpp:1891] Asked to kill task 
> service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework 
> f65b163c-0faf-441f-ac14-91739fa4394c-
> 13:33:35.775645  [slave.cpp:3002] Handling status update TASK_KILLED (UUID: 
> eba64915-7df2-483d-8982-a9a46a48a81b) for task 
> service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework 
> f65b163c-0faf-441f-ac14-91739fa4394c- f
> rom @0.0.0.0:0

[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-08-08 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5987:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 40
Story Points: 3

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3325) Running mesos-slave@0.23 in a container causes slave to be lost after a restart

2016-08-08 Thread Lei Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411550#comment-15411550
 ] 

Lei Xu commented on MESOS-3325:
---

Hi, We hit this issue months ago, mesos agent always read boot_id from host os 
and re-generate the slave id and register with master, I remember here is a 
issue to track this, but I forget the issue id, you can give a boot id to the 
agent to make sure the slave id do not change when restart.

> Running mesos-slave@0.23 in a container causes slave to be lost after a 
> restart
> ---
>
> Key: MESOS-3325
> URL: https://issues.apache.org/jira/browse/MESOS-3325
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.23.0
> Environment: CoreOS, Container, Docker
>Reporter: Chris Fortier
>Priority: Critical
>
> We are attempting to run mesos-slave 0.23 in a container. However it appears 
> that the mesos-slave agent registers as a new slave instead of 
> re-registering. This causes the formerly-launched tasks to continue running.
> systemd unit being used:
> ```
> [Unit]
> Description=MesosSlave
> After=docker.service dockercfg.service
> Requires=docker.service dockercfg.service
> [Service]
> Environment=MESOS_IMAGE=mesosphere/mesos-slave:0.23.0-1.0.ubuntu1404
> Environment=ZOOKEEPER=redacted
> User=core
> KillMode=process
> Restart=always
> RestartSec=20
> TimeoutStartSec=0
> ExecStartPre=-/usr/bin/docker kill mesos_slave
> ExecStartPre=-/usr/bin/docker rm mesos_slave
> ExecStartPre=/usr/bin/docker pull ${MESOS_IMAGE}
> ExecStart=/usr/bin/sh -c "sudo /usr/bin/docker run \
> --name=mesos_slave \
> --net=host \
> --pid=host \
> --privileged \
> -v /home/core/.dockercfg:/root/.dockercfg:ro \
> -v /sys:/sys \
> -v /usr/bin/docker:/usr/bin/docker:ro \
> -v /var/run/docker.sock:/var/run/docker.sock \
> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
> -v /var/lib/mesos/slave:/var/lib/mesos/slave \
> ${MESOS_IMAGE} \
> --ip=`curl -s http://169.254.169.254/latest/meta-data/local-ipv4` \
> --attributes=zone:$(curl -s 
> http://169.254.169.254/latest/meta-data/placement/availability-zone)\;os:coreos
>  \
> --containerizers=docker,mesos \
> --executor_registration_timeout=10mins \
> --hostname=`curl -s 
> http://169.254.169.254/latest/meta-data/public-hostname` \
> --log_dir=/var/log/mesos \
> --master=zk://${ZOOKEEPER}/mesos \
> --work_dir=/var/lib/mesos/slave"
> ExecStop=/usr/bin/docker stop mesos_slave
> [Install]
> WantedBy=multi-user.target
> [X-Fleet]
> Global=true
> MachineMetadata=role=worker
> ```
> ps, yes I saw the coreos-setup repo was deprecated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4440) Clean get/post/deleteRequest func and let the caller to use the general funcion.

2016-08-08 Thread Yongqiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411362#comment-15411362
 ] 

Yongqiao Wang commented on MESOS-4440:
--

[~adam-mesos] I plan to clean up the code described in this ticket, do you have 
time and give me a review? I will submit patches later.

> Clean get/post/deleteRequest func and let the caller to use the general 
> funcion.
> 
>
> Key: MESOS-4440
> URL: https://issues.apache.org/jira/browse/MESOS-4440
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Yongqiao Wang
>Assignee: Yongqiao Wang
>Priority: Minor
>  Labels: tech-debt
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)