[jira] [Updated] (MESOS-5889) Flakiness in SlaveRecoveryTest
[ https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5889: --- Sprint: Mesosphere Sprint 40 > Flakiness in SlaveRecoveryTest > -- > > Key: MESOS-5889 > URL: https://issues.apache.org/jira/browse/MESOS-5889 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway >Assignee: Benjamin Mahler > Labels: mesosphere > Attachments: slave_recovery_cleanup_http_executor.log, > slave_recovery_recover_terminated_executor.log, > slave_recovery_recover_unregistered_http_executor.log > > > Observed on internal CI. Seems like it is related to cgroups? Observed > similar failures in the following tests, and probably more related tests: > SlaveRecoveryTest/0.CleanupHTTPExecutor > SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor > SlaveRecoveryTest/0.RecoverTerminatedExecutor > Log files attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5889) Flakiness in SlaveRecoveryTest
[ https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-5889: -- Assignee: Benjamin Mahler > Flakiness in SlaveRecoveryTest > -- > > Key: MESOS-5889 > URL: https://issues.apache.org/jira/browse/MESOS-5889 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway >Assignee: Benjamin Mahler > Labels: mesosphere > Attachments: slave_recovery_cleanup_http_executor.log, > slave_recovery_recover_terminated_executor.log, > slave_recovery_recover_unregistered_http_executor.log > > > Observed on internal CI. Seems like it is related to cgroups? Observed > similar failures in the following tests, and probably more related tests: > SlaveRecoveryTest/0.CleanupHTTPExecutor > SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor > SlaveRecoveryTest/0.RecoverTerminatedExecutor > Log files attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)
[ https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412792#comment-15412792 ] AndyPang commented on MESOS-4577: - Yeah, I use a "__aarch64__" macro to distinguish the AARCH64 architecture or X86 architecture, but i don't know why the patch is discard by mesosphere. > libprocess can not run on 16-byte aligned stack mandatory > architecture(aarch64) > > > Key: MESOS-4577 > URL: https://issues.apache.org/jira/browse/MESOS-4577 > Project: Mesos > Issue Type: Bug > Components: libprocess, stout > Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 > 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux >Reporter: AndyPang >Assignee: AndyPang > Labels: mesosphere > > mesos run in AArch64 will get error, the log is: > {code} > E0101 00:06:56.636520 32411 slave.cpp:3342] Container > 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor > 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework > '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork > executor: Failed to clone child process: Failed to clone: Invalid argument > {code} > the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) > packaging a syscall "clone" : > {code:title=clone|borderStyle=solid} > inline pid_t clone(const lambda::function& func, int flags) > { > // Stack for the child. > // - unsigned long long used for best alignment. > // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux. > // > // NOTE: We need to allocate the stack dynamically. This is because > // glibc's 'clone' will modify the stack passed to it, therefore the > // stack must NOT be shared as multiple 'clone's can be invoked > // simultaneously. > int stackSize = 8 * 1024 * 1024; > unsigned long long *stack = > new unsigned long long[stackSize/sizeof(unsigned long long)]; > pid_t pid = ::clone( > childMain, > [stackSize/sizeof(stack[0]) - 1], // stack grows down. > flags, > (void*) ); > // If CLONE_VM is not set, ::clone would create a process which runs in a > // separate copy of the memory space of the calling process. So we destroy > the > // stack here to avoid memory leak. If CLONE_VM is set, ::clone would > create a > // thread which runs in the same memory space with the calling process. > if (!(flags & CLONE_VM)) { > delete[] stack; > } > return pid; > } > {code} > syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned > stack mandatory architecture(aarch64) it will get error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)
[ https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412787#comment-15412787 ] AndyPang commented on MESOS-4577: - It is really cool if it has been fixed by kernel 4.7. I temporarily modify it to 16 byte alignment,the patch in: https://reviews.apache.org/r/43182/diff/1#index_header > libprocess can not run on 16-byte aligned stack mandatory > architecture(aarch64) > > > Key: MESOS-4577 > URL: https://issues.apache.org/jira/browse/MESOS-4577 > Project: Mesos > Issue Type: Bug > Components: libprocess, stout > Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 > 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux >Reporter: AndyPang >Assignee: AndyPang > Labels: mesosphere > > mesos run in AArch64 will get error, the log is: > {code} > E0101 00:06:56.636520 32411 slave.cpp:3342] Container > 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor > 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework > '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork > executor: Failed to clone child process: Failed to clone: Invalid argument > {code} > the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) > packaging a syscall "clone" : > {code:title=clone|borderStyle=solid} > inline pid_t clone(const lambda::function& func, int flags) > { > // Stack for the child. > // - unsigned long long used for best alignment. > // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux. > // > // NOTE: We need to allocate the stack dynamically. This is because > // glibc's 'clone' will modify the stack passed to it, therefore the > // stack must NOT be shared as multiple 'clone's can be invoked > // simultaneously. > int stackSize = 8 * 1024 * 1024; > unsigned long long *stack = > new unsigned long long[stackSize/sizeof(unsigned long long)]; > pid_t pid = ::clone( > childMain, > [stackSize/sizeof(stack[0]) - 1], // stack grows down. > flags, > (void*) ); > // If CLONE_VM is not set, ::clone would create a process which runs in a > // separate copy of the memory space of the calling process. So we destroy > the > // stack here to avoid memory leak. If CLONE_VM is set, ::clone would > create a > // thread which runs in the same memory space with the calling process. > if (!(flags & CLONE_VM)) { > delete[] stack; > } > return pid; > } > {code} > syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned > stack mandatory architecture(aarch64) it will get error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-5830: Shepherd: Michael Park > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Assignee: Gaojin CAO >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Commented] (MESOS-6009) Design doc for task groups
[ https://issues.apache.org/jira/browse/MESOS-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412736#comment-15412736 ] Vinod Kone commented on MESOS-6009: --- https://docs.google.com/document/d/1FtcyQkDfGp-bPHTW4pUoqQCgVlPde936bo-IIENO_ho/edit#heading=h.ip4t59nlogfz > Design doc for task groups > -- > > Key: MESOS-6009 > URL: https://issues.apache.org/jira/browse/MESOS-6009 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Jie Yu > > This ticket tracks the design for implementing task groups which can be used > to deliver pod like semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6009) Design doc for task groups
Vinod Kone created MESOS-6009: - Summary: Design doc for task groups Key: MESOS-6009 URL: https://issues.apache.org/jira/browse/MESOS-6009 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Jie Yu This ticket tracks the design for implementing task groups which can be used to deliver pod like semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412734#comment-15412734 ] Gaojin CAO commented on MESOS-5830: --- ok, thanks. > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Assignee: Gaojin CAO >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Commented] (MESOS-5889) Flakiness in SlaveRecoveryTest
[ https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412667#comment-15412667 ] Benjamin Mahler commented on MESOS-5889: [~neilc] we try to include good vs bad logs in flaky test reports as it lowers the barrier to looking into the issue (don't need to go digging around CI or compile / run it locally to get logs for a good run). For example: MESOS-4800. > Flakiness in SlaveRecoveryTest > -- > > Key: MESOS-5889 > URL: https://issues.apache.org/jira/browse/MESOS-5889 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway > Labels: mesosphere > Attachments: slave_recovery_cleanup_http_executor.log, > slave_recovery_recover_terminated_executor.log, > slave_recovery_recover_unregistered_http_executor.log > > > Observed on internal CI. Seems like it is related to cgroups? Observed > similar failures in the following tests, and probably more related tests: > SlaveRecoveryTest/0.CleanupHTTPExecutor > SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor > SlaveRecoveryTest/0.RecoverTerminatedExecutor > Log files attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6008) Add the infrastructure for a new python-based CLI.
Kevin Klues created MESOS-6008: -- Summary: Add the infrastructure for a new python-based CLI. Key: MESOS-6008 URL: https://issues.apache.org/jira/browse/MESOS-6008 Project: Mesos Issue Type: Improvement Reporter: Kevin Klues Assignee: Kevin Klues -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412640#comment-15412640 ] Gaojin CAO commented on MESOS-5830: --- https://reviews.apache.org/r/50887/ https://reviews.apache.org/r/50899/ https://reviews.apache.org/r/50900/ > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Assignee: Gaojin CAO >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Assigned] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaojin CAO reassigned MESOS-5830: - Assignee: Gaojin CAO > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Assignee: Gaojin CAO >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Updated] (MESOS-5988) PollSocketImpl can write to a stale fd.
[ https://issues.apache.org/jira/browse/MESOS-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5988: - Shepherd: Benjamin Mahler > PollSocketImpl can write to a stale fd. > --- > > Key: MESOS-5988 > URL: https://issues.apache.org/jira/browse/MESOS-5988 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1 > > > When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We > were curious why PollSocketImpl avoids the same issue. It seems that > PollSocketImpl has a similar race, however in the case of PollSocketImpl we > will simply write to a stale file descriptor. > One example is {{PollSocketImpl::send(const char*, size_t)}}: > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245 > {code} > Future PollSocketImpl::send(const char* data, size_t size) > { > return io::poll(get(), io::WRITE) > .then(lambda::bind(::socket_send_data, get(), data, size)); > } > Future socket_send_data(int s, const char* data, size_t size) > { > CHECK(size > 0); > while (true) { > ssize_t length = send(s, data, size, MSG_NOSIGNAL); > #ifdef __WINDOWS__ > int error = WSAGetLastError(); > #else > int error = errno; > #endif // __WINDOWS__ > if (length < 0 && net::is_restartable_error(error)) { > // Interrupted, try again now. > continue; > } else if (length < 0 && net::is_retryable_error(error)) { > // Might block, try again later. > return io::poll(s, io::WRITE) > .then(lambda::bind(::socket_send_data, s, data, size)); > } else if (length <= 0) { > // Socket error or closed. > if (length < 0) { > const string error = os::strerror(errno); > VLOG(1) << "Socket error while sending: " << error; > } else { > VLOG(1) << "Socket closed while sending"; > } > if (length == 0) { > return length; > } else { > return Failure(ErrnoError("Socket send failed")); > } > } else { > CHECK(length > 0); > return length; > } > } > } > {code} > If the last reference to the {{Socket}} goes away before the > {{socket_send_data}} loop completes, then we will write to a stale fd! > It turns out that we have avoided this issue because in libprocess we happen > to keep a reference to the {{Socket}} around when sending: > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707 > {code} > void send(Encoder* encoder, Socket socket) > { > switch (encoder->kind()) { > case Encoder::DATA: { > size_t size; > const char* data = static_cast(encoder)->next(); > socket.send(data, size) > .onAny(lambda::bind( > ::_send, > lambda::_1, > socket, > encoder, > size)); > break; > } > case Encoder::FILE: { > off_t offset; > size_t size; > int fd = static_cast (encoder)->next(, ); > socket.sendfile(fd, offset, size) > .onAny(lambda::bind( > ::_send, > lambda::_1, > socket, > encoder, > size)); > break; > } > } > } > {code} > However, this may not be true in all call-sites going forward. Currently, it > appears that http::Connection can trigger this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6007) Operator API v1 Improvements
Vinod Kone created MESOS-6007: - Summary: Operator API v1 Improvements Key: MESOS-6007 URL: https://issues.apache.org/jira/browse/MESOS-6007 Project: Mesos Issue Type: Epic Reporter: Vinod Kone This is follow up epic to track the improvement work from MESOS-4791. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6006) Abstract mesos-style.py to allow future linters to be added more easily
Kevin Klues created MESOS-6006: -- Summary: Abstract mesos-style.py to allow future linters to be added more easily Key: MESOS-6006 URL: https://issues.apache.org/jira/browse/MESOS-6006 Project: Mesos Issue Type: Improvement Reporter: Kevin Klues Assignee: Kevin Klues Fix For: 1.1.0 Currently, mesos-style.py is just a collection of functions that check the style of relevant files in the mesos code base. However, the script assumes that we always wanted to run cpplint over every file we are checking. Since we are planning on adding a python linter to the codebase soon, it makes sense to abstract the common functionality from this script into a class so that a cpp-based linter and a python-based linter can inherit the same set of common functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5988) PollSocketImpl can write to a stale fd.
[ https://issues.apache.org/jira/browse/MESOS-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5988: - Sprint: Mesosphere Sprint 40 Story Points: 3 > PollSocketImpl can write to a stale fd. > --- > > Key: MESOS-5988 > URL: https://issues.apache.org/jira/browse/MESOS-5988 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1 > > > When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We > were curious why PollSocketImpl avoids the same issue. It seems that > PollSocketImpl has a similar race, however in the case of PollSocketImpl we > will simply write to a stale file descriptor. > One example is {{PollSocketImpl::send(const char*, size_t)}}: > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245 > {code} > Future PollSocketImpl::send(const char* data, size_t size) > { > return io::poll(get(), io::WRITE) > .then(lambda::bind(::socket_send_data, get(), data, size)); > } > Future socket_send_data(int s, const char* data, size_t size) > { > CHECK(size > 0); > while (true) { > ssize_t length = send(s, data, size, MSG_NOSIGNAL); > #ifdef __WINDOWS__ > int error = WSAGetLastError(); > #else > int error = errno; > #endif // __WINDOWS__ > if (length < 0 && net::is_restartable_error(error)) { > // Interrupted, try again now. > continue; > } else if (length < 0 && net::is_retryable_error(error)) { > // Might block, try again later. > return io::poll(s, io::WRITE) > .then(lambda::bind(::socket_send_data, s, data, size)); > } else if (length <= 0) { > // Socket error or closed. > if (length < 0) { > const string error = os::strerror(errno); > VLOG(1) << "Socket error while sending: " << error; > } else { > VLOG(1) << "Socket closed while sending"; > } > if (length == 0) { > return length; > } else { > return Failure(ErrnoError("Socket send failed")); > } > } else { > CHECK(length > 0); > return length; > } > } > } > {code} > If the last reference to the {{Socket}} goes away before the > {{socket_send_data}} loop completes, then we will write to a stale fd! > It turns out that we have avoided this issue because in libprocess we happen > to keep a reference to the {{Socket}} around when sending: > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707 > {code} > void send(Encoder* encoder, Socket socket) > { > switch (encoder->kind()) { > case Encoder::DATA: { > size_t size; > const char* data = static_cast(encoder)->next(); > socket.send(data, size) > .onAny(lambda::bind( > ::_send, > lambda::_1, > socket, > encoder, > size)); > break; > } > case Encoder::FILE: { > off_t offset; > size_t size; > int fd = static_cast (encoder)->next(, ); > socket.sendfile(fd, offset, size) > .onAny(lambda::bind( > ::_send, > lambda::_1, > socket, > encoder, > size)); > break; > } > } > } > {code} > However, this may not be true in all call-sites going forward. Currently, it > appears that http::Connection can trigger this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5988) PollSocketImpl can write to a stale fd.
[ https://issues.apache.org/jira/browse/MESOS-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-5988: Assignee: Greg Mann > PollSocketImpl can write to a stale fd. > --- > > Key: MESOS-5988 > URL: https://issues.apache.org/jira/browse/MESOS-5988 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1 > > > When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We > were curious why PollSocketImpl avoids the same issue. It seems that > PollSocketImpl has a similar race, however in the case of PollSocketImpl we > will simply write to a stale file descriptor. > One example is {{PollSocketImpl::send(const char*, size_t)}}: > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245 > {code} > Future PollSocketImpl::send(const char* data, size_t size) > { > return io::poll(get(), io::WRITE) > .then(lambda::bind(::socket_send_data, get(), data, size)); > } > Future socket_send_data(int s, const char* data, size_t size) > { > CHECK(size > 0); > while (true) { > ssize_t length = send(s, data, size, MSG_NOSIGNAL); > #ifdef __WINDOWS__ > int error = WSAGetLastError(); > #else > int error = errno; > #endif // __WINDOWS__ > if (length < 0 && net::is_restartable_error(error)) { > // Interrupted, try again now. > continue; > } else if (length < 0 && net::is_retryable_error(error)) { > // Might block, try again later. > return io::poll(s, io::WRITE) > .then(lambda::bind(::socket_send_data, s, data, size)); > } else if (length <= 0) { > // Socket error or closed. > if (length < 0) { > const string error = os::strerror(errno); > VLOG(1) << "Socket error while sending: " << error; > } else { > VLOG(1) << "Socket closed while sending"; > } > if (length == 0) { > return length; > } else { > return Failure(ErrnoError("Socket send failed")); > } > } else { > CHECK(length > 0); > return length; > } > } > } > {code} > If the last reference to the {{Socket}} goes away before the > {{socket_send_data}} loop completes, then we will write to a stale fd! > It turns out that we have avoided this issue because in libprocess we happen > to keep a reference to the {{Socket}} around when sending: > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707 > {code} > void send(Encoder* encoder, Socket socket) > { > switch (encoder->kind()) { > case Encoder::DATA: { > size_t size; > const char* data = static_cast(encoder)->next(); > socket.send(data, size) > .onAny(lambda::bind( > ::_send, > lambda::_1, > socket, > encoder, > size)); > break; > } > case Encoder::FILE: { > off_t offset; > size_t size; > int fd = static_cast (encoder)->next(, ); > socket.sendfile(fd, offset, size) > .onAny(lambda::bind( > ::_send, > lambda::_1, > socket, > encoder, > size)); > break; > } > } > } > {code} > However, this may not be true in all call-sites going forward. Currently, it > appears that http::Connection can trigger this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412268#comment-15412268 ] Gilbert Song commented on MESOS-5830: - [~zerobleed], welcome to the community! Please quickly go through the doc [~haosd...@gmail.com] pasted above. You may need a shepherd before for this JIRA. Feel free to join the community slack channel (mesos.slack.com). You can get quick answers if asking questions there. :) > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Commented] (MESOS-5991) Support running docker daemon inside a container using unified containerizer.
[ https://issues.apache.org/jira/browse/MESOS-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412260#comment-15412260 ] Stéphane Cottin commented on MESOS-5991: standalone. I plan to migrate to mesos plugin, when it will be compatible with mesos >= 1.0.0 and unified containerizer. > Support running docker daemon inside a container using unified containerizer. > - > > Key: MESOS-5991 > URL: https://issues.apache.org/jira/browse/MESOS-5991 > Project: Mesos > Issue Type: Epic >Reporter: Jie Yu > > The goal is to develop necessary pieces in unified containerizer so that > framework can launch a full fledge docker daemon in a container. > This will be useful for frameworks like jenkins. The jenkins job can still > use docker cli to do build (e.g., `docker build`, `docker push`), but we > don't have to install docker daemon on the host anymore. > Looks like LXD already support that and is pretty stable for some users. We > should do some investigation to see what features that's missing in unified > containerizer to be able to match what lxd has. Will track all the > dependencies in this ticket. > https://www.stgraber.org/2016/04/13/lxd-2-0-docker-in-lxd-712/ > Cgroups and user namespaces support are definitely missing pieces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5991) Support running docker daemon inside a container using unified containerizer.
[ https://issues.apache.org/jira/browse/MESOS-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412245#comment-15412245 ] Sunil Shah commented on MESOS-5991: --- [~kaalh]: are you using this Docker image with the Jenkins Mesos plugin or running it standalone? > Support running docker daemon inside a container using unified containerizer. > - > > Key: MESOS-5991 > URL: https://issues.apache.org/jira/browse/MESOS-5991 > Project: Mesos > Issue Type: Epic >Reporter: Jie Yu > > The goal is to develop necessary pieces in unified containerizer so that > framework can launch a full fledge docker daemon in a container. > This will be useful for frameworks like jenkins. The jenkins job can still > use docker cli to do build (e.g., `docker build`, `docker push`), but we > don't have to install docker daemon on the host anymore. > Looks like LXD already support that and is pretty stable for some users. We > should do some investigation to see what features that's missing in unified > containerizer to be able to match what lxd has. Will track all the > dependencies in this ticket. > https://www.stgraber.org/2016/04/13/lxd-2-0-docker-in-lxd-712/ > Cgroups and user namespaces support are definitely missing pieces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5986: --- Affects Version/s: 1.0.0 > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.0.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1 > > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up > (89)@127.0.0.1:55688 > I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in > send_callback(bev) > I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to > (86)@127.0.0.1:55688 while waiting > I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming > (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00 > I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in > send_callback(): 17 > I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up > (76)@127.0.0.1:55688 > I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming > (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00 > I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up > (86)@127.0.0.1:55688 > I0804 08:32:33.296497 275939328
[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5986: --- Fix Version/s: 1.1.0 > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.0.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1 > > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up > (89)@127.0.0.1:55688 > I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in > send_callback(bev) > I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to > (86)@127.0.0.1:55688 while waiting > I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming > (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00 > I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in > send_callback(): 17 > I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up > (76)@127.0.0.1:55688 > I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming > (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00 > I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up > (86)@127.0.0.1:55688 > I0804 08:32:33.296497 275939328
[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5986: --- Fix Version/s: (was: 1.1.0) > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.0.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1 > > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up > (89)@127.0.0.1:55688 > I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in > send_callback(bev) > I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to > (86)@127.0.0.1:55688 while waiting > I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming > (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00 > I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in > send_callback(): 17 > I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up > (76)@127.0.0.1:55688 > I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming > (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00 > I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up > (86)@127.0.0.1:55688 > I0804 08:32:33.296497 275939328
[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5986: --- Fix Version/s: 1.0.1 > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.0.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.1, 1.1.0 > > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up > (89)@127.0.0.1:55688 > I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in > send_callback(bev) > I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to > (86)@127.0.0.1:55688 while waiting > I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming > (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00 > I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in > send_callback(): 17 > I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up > (76)@127.0.0.1:55688 > I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming > (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00 > I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up > (86)@127.0.0.1:55688 > I0804 08:32:33.296497 275939328
[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412219#comment-15412219 ] Gilbert Song commented on MESOS-6004: - Ok, answer the #3. Just saw it from the slack channel. Your image contains about 55 layers. > Tasks fail when provisioning multiple containers with large docker images > using copy backend > - > > Key: MESOS-6004 > URL: https://issues.apache.org/jira/browse/MESOS-6004 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.2, 1.0.0 > Environment: h4. Agent Platform > - Ubuntu 16.04 > - AWS g2.x2large instance > - Nvidia support enabled > h4. Agent Configuration > -{noformat} > --containerizers=mesos,docker > --docker_config= > --docker_store_dir=/mnt/mesos/store/docker > --executor_registration_timeout=3mins > --hostname= > --image_providers=docker > --image_provisioner_backend=copy > --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia > --switch_user=false > --work_dir=/mnt/mesos > {noformat} > h4. Framework > - custom framework written in python > - using unified containerizer with docker images > h4. Test Setup > * 1 master > * 1 agent > * 5 tasks scheduled at the same time: > ** resources: cpus: 0.1, mem: 128 > ** command: `echo test` > ** docker image: custom docker image, based on nvidia/cuda ~5gb > ** the same docker image was for all tasks, already pulled. >Reporter: Michael Thomas > Labels: containerizer, docker, performance > > When scheduling more than one task on the same agent, all tasks fail a as > containers seem to be destroyed during provisioning. > Specifically, the errors on the agent logs are: > {noformat} > E0808 15:53:09.691315 30996 slave.cpp:3976] Container > 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework > c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being > destroyed during provisioning > {noformat} > and > {noformat} > I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of > framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not > register within 3mins > {noformat} > As the default provisioning method {{copy}} is being used, I assume this is > due to the provisioning of multiple containers taking too long and the agent > will not wait. For large images, this method is simply not performant. > The issue did not occur, when only one tasks was scheduled. > Increasing the {{executor_registration_timeout}} parameter, seemed to help a > bit as it allowed to schedule at least 2 tasks at the same time. But still > fails with more (5 in this case) > h4. Complete logs > (with GLOG_v=0, as with 1 it was too long) > {noformat} > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 > 30961 main.cpp:434] Starting Mesos agent > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 > 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 > 30961 slave.cpp:199] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos,docker" --default_role="*" > --disk_watch_interval="1mins" --docker="docker" > --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" > --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/mnt/mesos/store/docker" --do > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: > cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="3mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" > --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" > --hostname_lookup="true" --http_authenticators="basic" > --http_command_executor="false" --image_providers="docker" > --image_provisioner_backend="copy"
[jira] [Commented] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412203#comment-15412203 ] Greg Mann commented on MESOS-5986: -- {code} commit f5822f3c13f4fdacbb390341940d3379248a9837 Author: Greg Mann g...@mesosphere.io Date: Fri Aug 5 18:19:33 2016 -0700 Removed incorrect CHECK in SSL socket `send()`. The lambda placed on the event loop by the libevent SSL socket's `send()` method previously used a `CHECK` to ensure that the socket's `send_request` member was not `nullptr`. This patch removes this check, since `send_request` may become `nullptr` any time the socket receives an EOF or ERROR event. Note that the current handling of events is incorrect also, but we do not attempt a fix here. To be specific, reading EOF should not deal with send requests at all (see MESOS-5999). Also, the ERROR events are not differentiated between reading and writing. Lastly, when we receive an EOF we do not ensure that the caller can read the bytes that remain in the buffer! Review: https://reviews.apache.org/r/50741/ {code} > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804
[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5986: - Fix Version/s: (was: 1.0.1) > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up > (89)@127.0.0.1:55688 > I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in > send_callback(bev) > I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to > (86)@127.0.0.1:55688 while waiting > I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming > (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00 > I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in > send_callback(): 17 > I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up > (76)@127.0.0.1:55688 > I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming > (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00 > I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up > (86)@127.0.0.1:55688 > I0804 08:32:33.296497 275939328 libevent_ssl_socket.cpp:104] *** releasing > SSL socket > I0804
[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412187#comment-15412187 ] Gilbert Song commented on MESOS-6004: - 3. Attach the approx. image layers number as well. Appreciated! > Tasks fail when provisioning multiple containers with large docker images > using copy backend > - > > Key: MESOS-6004 > URL: https://issues.apache.org/jira/browse/MESOS-6004 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.2, 1.0.0 > Environment: h4. Agent Platform > - Ubuntu 16.04 > - AWS g2.x2large instance > - Nvidia support enabled > h4. Agent Configuration > -{noformat} > --containerizers=mesos,docker > --docker_config= > --docker_store_dir=/mnt/mesos/store/docker > --executor_registration_timeout=3mins > --hostname= > --image_providers=docker > --image_provisioner_backend=copy > --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia > --switch_user=false > --work_dir=/mnt/mesos > {noformat} > h4. Framework > - custom framework written in python > - using unified containerizer with docker images > h4. Test Setup > * 1 master > * 1 agent > * 5 tasks scheduled at the same time: > ** resources: cpus: 0.1, mem: 128 > ** command: `echo test` > ** docker image: custom docker image, based on nvidia/cuda ~5gb > ** the same docker image was for all tasks, already pulled. >Reporter: Michael Thomas > Labels: containerizer, docker, performance > > When scheduling more than one task on the same agent, all tasks fail a as > containers seem to be destroyed during provisioning. > Specifically, the errors on the agent logs are: > {noformat} > E0808 15:53:09.691315 30996 slave.cpp:3976] Container > 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework > c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being > destroyed during provisioning > {noformat} > and > {noformat} > I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of > framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not > register within 3mins > {noformat} > As the default provisioning method {{copy}} is being used, I assume this is > due to the provisioning of multiple containers taking too long and the agent > will not wait. For large images, this method is simply not performant. > The issue did not occur, when only one tasks was scheduled. > Increasing the {{executor_registration_timeout}} parameter, seemed to help a > bit as it allowed to schedule at least 2 tasks at the same time. But still > fails with more (5 in this case) > h4. Complete logs > (with GLOG_v=0, as with 1 it was too long) > {noformat} > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 > 30961 main.cpp:434] Starting Mesos agent > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 > 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 > 30961 slave.cpp:199] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos,docker" --default_role="*" > --disk_watch_interval="1mins" --docker="docker" > --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" > --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/mnt/mesos/store/docker" --do > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: > cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="3mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" > --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" > --hostname_lookup="true" --http_authenticators="basic" > --http_command_executor="false" --image_providers="docker" > --image_provisioner_backend="copy" --initialize_driver_logging="true" >
[jira] [Comment Edited] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412159#comment-15412159 ] Gilbert Song edited comment on MESOS-6004 at 8/8/16 5:56 PM: - Thanks [~mito]. We need to fix this issue. Most likely this is because the image size is too large, and it takes time to download/copy. Could you please: 1. Just for curious, could you test using localpuller& backend (--docker_registry=/path/to/your/image/tarballs/folder and --image_provider_backend=overlay). Want to know whether you still have the scheduling issue. 2. Attach the GLOG_v=1 log, should be fine in size if you are using `noformat`. was (Author: gilbert): Thanks [~mito]. We need to fix this issue. Most likely this is because the image size is too large, and it takes time to download/copy. Could you please: 1. Just for curious, could you test using localpuller& backend (`--docker_registry=/path/to/your/image/tarballs/folder` and `--image_provider_backend=overlay`). Want to know whether you still have the scheduling issue. 2. Attach the GLOG_v=1 log, should be fine in size if you are using `noformat`. > Tasks fail when provisioning multiple containers with large docker images > using copy backend > - > > Key: MESOS-6004 > URL: https://issues.apache.org/jira/browse/MESOS-6004 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.2, 1.0.0 > Environment: h4. Agent Platform > - Ubuntu 16.04 > - AWS g2.x2large instance > - Nvidia support enabled > h4. Agent Configuration > -{noformat} > --containerizers=mesos,docker > --docker_config= > --docker_store_dir=/mnt/mesos/store/docker > --executor_registration_timeout=3mins > --hostname= > --image_providers=docker > --image_provisioner_backend=copy > --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia > --switch_user=false > --work_dir=/mnt/mesos > {noformat} > h4. Framework > - custom framework written in python > - using unified containerizer with docker images > h4. Test Setup > * 1 master > * 1 agent > * 5 tasks scheduled at the same time: > ** resources: cpus: 0.1, mem: 128 > ** command: `echo test` > ** docker image: custom docker image, based on nvidia/cuda ~5gb > ** the same docker image was for all tasks, already pulled. >Reporter: Michael Thomas > Labels: containerizer, docker, performance > > When scheduling more than one task on the same agent, all tasks fail a as > containers seem to be destroyed during provisioning. > Specifically, the errors on the agent logs are: > {noformat} > E0808 15:53:09.691315 30996 slave.cpp:3976] Container > 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework > c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being > destroyed during provisioning > {noformat} > and > {noformat} > I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of > framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not > register within 3mins > {noformat} > As the default provisioning method {{copy}} is being used, I assume this is > due to the provisioning of multiple containers taking too long and the agent > will not wait. For large images, this method is simply not performant. > The issue did not occur, when only one tasks was scheduled. > Increasing the {{executor_registration_timeout}} parameter, seemed to help a > bit as it allowed to schedule at least 2 tasks at the same time. But still > fails with more (5 in this case) > h4. Complete logs > (with GLOG_v=0, as with 1 it was too long) > {noformat} > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 > 30961 main.cpp:434] Starting Mesos agent > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 > 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 > 30961 slave.cpp:199] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos,docker" --default_role="*" > --disk_watch_interval="1mins" --docker="docker" > --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" >
[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-6004: Affects Version/s: 0.28.2 > Tasks fail when provisioning multiple containers with large docker images > using copy backend > - > > Key: MESOS-6004 > URL: https://issues.apache.org/jira/browse/MESOS-6004 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.2, 1.0.0 > Environment: h4. Agent Platform > - Ubuntu 16.04 > - AWS g2.x2large instance > - Nvidia support enabled > h4. Agent Configuration > -{noformat} > --containerizers=mesos,docker > --docker_config= > --docker_store_dir=/mnt/mesos/store/docker > --executor_registration_timeout=3mins > --hostname= > --image_providers=docker > --image_provisioner_backend=copy > --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia > --switch_user=false > --work_dir=/mnt/mesos > {noformat} > h4. Framework > - custom framework written in python > - using unified containerizer with docker images > h4. Test Setup > * 1 master > * 1 agent > * 5 tasks scheduled at the same time: > ** resources: cpus: 0.1, mem: 128 > ** command: `echo test` > ** docker image: custom docker image, based on nvidia/cuda ~5gb > ** the same docker image was for all tasks, already pulled. >Reporter: Michael Thomas > Labels: containerizer, docker, performance > > When scheduling more than one task on the same agent, all tasks fail a as > containers seem to be destroyed during provisioning. > Specifically, the errors on the agent logs are: > {noformat} > E0808 15:53:09.691315 30996 slave.cpp:3976] Container > 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework > c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being > destroyed during provisioning > {noformat} > and > {noformat} > I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of > framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not > register within 3mins > {noformat} > As the default provisioning method {{copy}} is being used, I assume this is > due to the provisioning of multiple containers taking too long and the agent > will not wait. For large images, this method is simply not performant. > The issue did not occur, when only one tasks was scheduled. > Increasing the {{executor_registration_timeout}} parameter, seemed to help a > bit as it allowed to schedule at least 2 tasks at the same time. But still > fails with more (5 in this case) > h4. Complete logs > (with GLOG_v=0, as with 1 it was too long) > {noformat} > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 > 30961 main.cpp:434] Starting Mesos agent > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 > 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 > 30961 slave.cpp:199] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos,docker" --default_role="*" > --disk_watch_interval="1mins" --docker="docker" > --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" > --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/mnt/mesos/store/docker" --do > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: > cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="3mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" > --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" > --hostname_lookup="true" --http_authenticators="basic" > --http_command_executor="false" --image_providers="docker" > --image_provisioner_backend="copy" --initialize_driver_logging="true" > --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" >
[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412159#comment-15412159 ] Gilbert Song commented on MESOS-6004: - Thanks [~mito]. We need to fix this issue. Most likely this is because the image size is too large, and it takes time to download/copy. Could you please: 1. Just for curious, could you test using localpuller& backend (`--docker_registry=/path/to/your/image/tarballs/folder` and `--image_provider_backend=overlay`). Want to know whether you still have the scheduling issue. 2. Attach the GLOG_v=1 log, should be fine in size if you are using `noformat`. > Tasks fail when provisioning multiple containers with large docker images > using copy backend > - > > Key: MESOS-6004 > URL: https://issues.apache.org/jira/browse/MESOS-6004 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0 > Environment: h4. Agent Platform > - Ubuntu 16.04 > - AWS g2.x2large instance > - Nvidia support enabled > h4. Agent Configuration > -{noformat} > --containerizers=mesos,docker > --docker_config= > --docker_store_dir=/mnt/mesos/store/docker > --executor_registration_timeout=3mins > --hostname= > --image_providers=docker > --image_provisioner_backend=copy > --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia > --switch_user=false > --work_dir=/mnt/mesos > {noformat} > h4. Framework > - custom framework written in python > - using unified containerizer with docker images > h4. Test Setup > * 1 master > * 1 agent > * 5 tasks scheduled at the same time: > ** resources: cpus: 0.1, mem: 128 > ** command: `echo test` > ** docker image: custom docker image, based on nvidia/cuda ~5gb > ** the same docker image was for all tasks, already pulled. >Reporter: Michael Thomas > Labels: containerizer, docker, performance > > When scheduling more than one task on the same agent, all tasks fail a as > containers seem to be destroyed during provisioning. > Specifically, the errors on the agent logs are: > {noformat} > E0808 15:53:09.691315 30996 slave.cpp:3976] Container > 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework > c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being > destroyed during provisioning > {noformat} > and > {noformat} > I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of > framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not > register within 3mins > {noformat} > As the default provisioning method {{copy}} is being used, I assume this is > due to the provisioning of multiple containers taking too long and the agent > will not wait. For large images, this method is simply not performant. > The issue did not occur, when only one tasks was scheduled. > Increasing the {{executor_registration_timeout}} parameter, seemed to help a > bit as it allowed to schedule at least 2 tasks at the same time. But still > fails with more (5 in this case) > h4. Complete logs > (with GLOG_v=0, as with 1 it was too long) > {noformat} > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 > 30961 main.cpp:434] Starting Mesos agent > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 > 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 > 30961 slave.cpp:199] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos,docker" --default_role="*" > --disk_watch_interval="1mins" --docker="docker" > --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" > --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/mnt/mesos/store/docker" --do > Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: > cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="3mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >
[jira] [Updated] (MESOS-6003) Add logging module for logging to an external program
[ https://issues.apache.org/jira/browse/MESOS-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-6003: - Shepherd: Joseph Wu > Add logging module for logging to an external program > - > > Key: MESOS-6003 > URL: https://issues.apache.org/jira/browse/MESOS-6003 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Will Rouesnel >Assignee: Will Rouesnel >Priority: Minor > > In the vein of the logrotate module for logging, there should be a similar > module which provides support for logging to an arbitrary log handling > program, with suitable task metadata provided by environment variables or > command line arguments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6005) Support docker registry running non-https on localhost:
[ https://issues.apache.org/jira/browse/MESOS-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412136#comment-15412136 ] Gilbert Song commented on MESOS-6005: - Thanks [~zhitao], we will address it. > Support docker registry running non-https on localhost: > > > Key: MESOS-6005 > URL: https://issues.apache.org/jira/browse/MESOS-6005 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Zhitao Li > > (Please update title with whatever this ended up) > Docker daemon by default does not use https if the registry host is > localhost/127.0.0.1, which is what many people use in dev testing or alike. > Right now image fetching only support plain http if port is 80. Ideally this > can be configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)
[ https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412091#comment-15412091 ] gtin commented on MESOS-4577: - I got it to work temporarily until there is a mainline kernel for odroid c2 that supports this. The temporary change I made was in file mesos/3rdparty/stout/include/stout/os/linux.hpp changed the stack type from unsigned long long to long double to provide a 16 byte alignment. long double *stack = new long double[stackSize/sizeof(long double)]; pid_t pid = ::clone( childMain, [stackSize/sizeof(stack[0]) - 1], // stack grows down. flags, (void*) ); > libprocess can not run on 16-byte aligned stack mandatory > architecture(aarch64) > > > Key: MESOS-4577 > URL: https://issues.apache.org/jira/browse/MESOS-4577 > Project: Mesos > Issue Type: Bug > Components: libprocess, stout > Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 > 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux >Reporter: AndyPang >Assignee: AndyPang > Labels: mesosphere > > mesos run in AArch64 will get error, the log is: > {code} > E0101 00:06:56.636520 32411 slave.cpp:3342] Container > 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor > 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework > '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork > executor: Failed to clone child process: Failed to clone: Invalid argument > {code} > the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) > packaging a syscall "clone" : > {code:title=clone|borderStyle=solid} > inline pid_t clone(const lambda::function& func, int flags) > { > // Stack for the child. > // - unsigned long long used for best alignment. > // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux. > // > // NOTE: We need to allocate the stack dynamically. This is because > // glibc's 'clone' will modify the stack passed to it, therefore the > // stack must NOT be shared as multiple 'clone's can be invoked > // simultaneously. > int stackSize = 8 * 1024 * 1024; > unsigned long long *stack = > new unsigned long long[stackSize/sizeof(unsigned long long)]; > pid_t pid = ::clone( > childMain, > [stackSize/sizeof(stack[0]) - 1], // stack grows down. > flags, > (void*) ); > // If CLONE_VM is not set, ::clone would create a process which runs in a > // separate copy of the memory space of the calling process. So we destroy > the > // stack here to avoid memory leak. If CLONE_VM is set, ::clone would > create a > // thread which runs in the same memory space with the calling process. > if (!(flags & CLONE_VM)) { > delete[] stack; > } > return pid; > } > {code} > syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned > stack mandatory architecture(aarch64) it will get error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Thomas updated MESOS-6004: -- Description: When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method {{copy}} is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the {{executor_registration_timeout}} parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=0, as with 1 it was too long) {noformat} Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 main.cpp:434] Starting Mesos agent Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="3mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recov Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/mnt/mesos" Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; disk(*):60257; ports(*):[31000-32000] Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 slave.cpp:527] Agent attributes: [ ] Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 slave.cpp:532] Agent hostname: ec2-52-59-113-0.eu-central-1.compute.amazonaws.com Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 state.cpp:57] Recovering state from '/mnt/mesos/meta' Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 status_update_manager.cpp:200] Recovering status update manager Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 containerizer.cpp:522] Recovering
[jira] [Created] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
Michael Thomas created MESOS-6004: - Summary: Tasks fail when provisioning multiple containers with large docker images using copy backend Key: MESOS-6004 URL: https://issues.apache.org/jira/browse/MESOS-6004 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 1.0.0 Environment: h4. Agent Platform - Ubuntu 16.04 - AWS g2.x2large instance - Nvidia support enabled h4. Agent Configuration -{noformat} --containerizers=mesos,docker --docker_config= --docker_store_dir=/mnt/mesos/store/docker --executor_registration_timeout=3mins --hostname= --image_providers=docker --image_provisioner_backend=copy --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia --switch_user=false --work_dir=/mnt/mesos {noformat} h4. Framework - custom framework written in python - using unified containerizer with docker images h4. Test Setup * 1 master * 1 agent * 5 tasks scheduled at the same time: ** resources: cpus: 0.1, mem: 128 ** command: `echo test` ** docker image: custom docker image, based on nvidia/cuda ~5gb ** the same docker image was for all tasks, already pulled. Reporter: Michael Thomas When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method `copy` is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the `executor_registration_timeout` parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=0, as with 1 it to long) {noformat} Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 main.cpp:434] Starting Mesos agent Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="3mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recov Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true"
[jira] [Updated] (MESOS-6003) Add logging module for logging to an external program
[ https://issues.apache.org/jira/browse/MESOS-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6003: -- Assignee: Will Rouesnel > Add logging module for logging to an external program > - > > Key: MESOS-6003 > URL: https://issues.apache.org/jira/browse/MESOS-6003 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Will Rouesnel >Assignee: Will Rouesnel >Priority: Minor > > In the vein of the logrotate module for logging, there should be a similar > module which provides support for logging to an arbitrary log handling > program, with suitable task metadata provided by environment variables or > command line arguments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5028) Copy provisioner cannot replace directory with symlink
[ https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412053#comment-15412053 ] Zhitao Li commented on MESOS-5028: -- One thing I forgot to mention is that I did a {{docker save}} to tar file, and used local store registry option when performing the test. The problematic later I generated does not have a extra whiteout file in such a cast: /quote zhitao@zhitao-mesos1:~/mesos/build$ ls -alR /t/layers/90e46350e512b827e8fe73a053ededc13f7eb1bccca96dc8ef86d6a6cd98f29c/rootfs/ /t/layers/90e46350e512b827e8fe73a053ededc13f7eb1bccca96dc8ef86d6a6cd98f29c/rootfs/: total 12 drwxr-xr-x 3 root root 4096 Aug 8 16:36 . drwxr-xr-x 3 root root 4096 Aug 8 16:36 .. drwxrwxr-x 2 root root 4096 Aug 5 20:01 etc /t/layers/90e46350e512b827e8fe73a053ededc13f7eb1bccca96dc8ef86d6a6cd98f29c/rootfs/etc: total 8 drwxrwxr-x 2 root root 4096 Aug 5 20:01 . drwxr-xr-x 3 root root 4096 Aug 8 16:36 .. lrwxrwxrwx 1 root root4 Aug 5 20:01 cirros -> /tmp /quote > Copy provisioner cannot replace directory with symlink > -- > > Key: MESOS-5028 > URL: https://issues.apache.org/jira/browse/MESOS-5028 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Zhitao Li >Assignee: Gilbert Song > > I'm trying to play with the new image provisioner on our custom docker > images, but one of layer failed to get copied, possibly due to a dangling > symlink. > Error log with Glog_v=1: > {quote} > I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path > '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs' > to rootfs > '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6' > E0324 05:42:49.028506 15062 slave.cpp:3773] Container > '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework > 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot overwrite directory > ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’ > with non-directory > {quote} > Content of > _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_ > points to a non-existing absolute path (cannot provide exact path but it's a > result of us trying to mount apt keys into docker container at build time). > I believe what happened is that we executed a script at build time, which > contains equivalent of: > {quote} > rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412052#comment-15412052 ] Gaojin CAO commented on MESOS-5830: --- Sure, done! > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Commented] (MESOS-4577) libprocess can not run on 16-byte aligned stack mandatory architecture(aarch64)
[ https://issues.apache.org/jira/browse/MESOS-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411969#comment-15411969 ] gtin commented on MESOS-4577: - It seems this issue was fixed in the latest kernel 4.7. It does not enforce 16 byte alignment anymore. https://github.com/torvalds/linux/blob/v4.7/arch/arm64/kernel/process.c https://patchwork.codeaurora.org/patch/13893/ It would be nice to have a work around for us folks stuck on old kernels. > libprocess can not run on 16-byte aligned stack mandatory > architecture(aarch64) > > > Key: MESOS-4577 > URL: https://issues.apache.org/jira/browse/MESOS-4577 > Project: Mesos > Issue Type: Bug > Components: libprocess, stout > Environment: Linux 10-175-112-202 4.1.6-rc3.aarch64 #1 SMP Mon Oct 12 > 01:43:03 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux >Reporter: AndyPang >Assignee: AndyPang > Labels: mesosphere > > mesos run in AArch64 will get error, the log is: > {code} > E0101 00:06:56.636520 32411 slave.cpp:3342] Container > 'b6be429a-08f0-4d52-b01d-abfcb6e0106b' for executor > 'hello.84d205ae-f626-11de-bd66-7a3f6cf980b9' of framework > '868b9f04-9179-427b-b050-ee8f89ffa3bd-' failed to start: Failed to fork > executor: Failed to clone child process: Failed to clone: Invalid argument > {code} > the "clone" achieve in libprocess 3rdparty stout library(in linux.hpp) > packaging a syscall "clone" : > {code:title=clone|borderStyle=solid} > inline pid_t clone(const lambda::function& func, int flags) > { > // Stack for the child. > // - unsigned long long used for best alignment. > // - 8 MiB appears to be the default for "ulimit -s" on OSX and Linux. > // > // NOTE: We need to allocate the stack dynamically. This is because > // glibc's 'clone' will modify the stack passed to it, therefore the > // stack must NOT be shared as multiple 'clone's can be invoked > // simultaneously. > int stackSize = 8 * 1024 * 1024; > unsigned long long *stack = > new unsigned long long[stackSize/sizeof(unsigned long long)]; > pid_t pid = ::clone( > childMain, > [stackSize/sizeof(stack[0]) - 1], // stack grows down. > flags, > (void*) ); > // If CLONE_VM is not set, ::clone would create a process which runs in a > // separate copy of the memory space of the calling process. So we destroy > the > // stack here to avoid memory leak. If CLONE_VM is set, ::clone would > create a > // thread which runs in the same memory space with the calling process. > if (!(flags & CLONE_VM)) { > delete[] stack; > } > return pid; > } > {code} > syscal "clone" parameter stack is 8-byte aligned,so if in 16-byte aligned > stack mandatory architecture(aarch64) it will get error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-5830: Labels: mesosphere newbie (was: ) > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Trivial > Labels: mesosphere, newbie > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Commented] (MESOS-5830) Make a sweep to trim excess space around angle brackets
[ https://issues.apache.org/jira/browse/MESOS-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411797#comment-15411797 ] Benjamin Bannier commented on MESOS-5830: - [~zerobleed] I see you have already posted a patch (https://reviews.apache.org/r/50887/). Could you please first get yourself added as a contributor so you could then assign this ticket to yourself? After that you could post a link to the review and move this ticket to a reviewable state. > Make a sweep to trim excess space around angle brackets > --- > > Key: MESOS-5830 > URL: https://issues.apache.org/jira/browse/MESOS-5830 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Trivial > > The codebase still has pre-C++11 code where we needed to say e.g., > {{vector
[jira] [Commented] (MESOS-5536) Completed executors presented as alive
[ https://issues.apache.org/jira/browse/MESOS-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411757#comment-15411757 ] Tomasz Janiszewski commented on MESOS-5536: --- After updating to 0.28.2 completed executors still shown up. I'll delete them manually and monitor if new appears. > Completed executors presented as alive > -- > > Key: MESOS-5536 > URL: https://issues.apache.org/jira/browse/MESOS-5536 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.0 > Environment: Ubuntu 14.04.3 LTS >Reporter: Tomasz Janiszewski > > I'm running Mesos 0.28.0. Mesos {{slave(1)/state}} endpoint returns some > completed executors not in frameworks.completed_executors but in > frameworks.executors. Alsa this executor presents in {{monitor/statistics}} > {code:JavaScript:title=slave(1)/state} > { > "attributes": {...}, > "completed_frameworks": [], > "flags": {...}, > "frameworks": [ > { > "checkpoint": true, > "completed_executors": [...], > "executors": [ > { > "queued_tasks": [], > "tasks": [], > "completed_tasks": [ > { > "discovery": {...}, > "executor_id": "", > "framework_id": > "f65b163c-0faf-441f-ac14-91739fa4394c-", > "id": > "service.a3b609b8-27ec-11e6-8044-02c89eb9127e", > "labels": [...], > "name": "service", > "resources": {...}, > "slave_id": > "ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13", > "state": "TASK_KILLED", > "statuses": [] > } > ], > "container": "ead42e63-ac92-4ad0-a99c-4af9c3fa5e31", > "directory": "...", > "id": "service.a3b609b8-27ec-11e6-8044-02c89eb9127e", > "name": "Command Executor (Task: > service.a3b609b8-27ec-11e6-8044-02c89eb9127e) (Command: sh -c 'cd > service...')", > "resources": {...}, > "source": "service.a3b609b8-27ec-11e6-8044-02c89eb9127e" > > }, > ... > ], > } > ], > "git_sha": "961edbd82e691a619a4c171a7aadc9c32957fa73", > "git_tag": "0.28.0", > "version": "0.28.0", > ... > } > {code} > {code:title="var/log/mesos/mesos-slave.INFO"} > 13:33:19.479182 [slave.cpp:1361] Got assigned task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e for framework > f65b163c-0faf-441f-ac14-91739fa4394c- > 13:33:19.482566 [slave.cpp:1480] Launching task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e for framework > f65b163c-0faf-441f-ac14-91739fa4394c- > 13:33:19.483921 [paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' > to user 'mesosuser' > 13:33:19.504173 [slave.cpp:5367] Launching executor > service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework > f65b163c-0faf-441f-ac14-91739fa4394c- with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' > 13:33:19.505537 [containerizer.cpp:666] Starting container > 'ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' for executor > 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' of framework > 'f65b163c-0faf-441f-ac14-91739fa4394c-' > 13:33:19.505734 [slave.cpp:1698] Queuing task > 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' for executor > 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' of framework > f65b163c-0faf-441f-ac14-91739fa4394c- > ... > 13:33:19.977483 [containerizer.cpp:1118] Checkpointing executor's forked pid > 25576 to > '/tmp/mesos/meta/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31/pids/forked.pid' > 13:33:35.775195 [slave.cpp:1891] Asked to kill task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework > f65b163c-0faf-441f-ac14-91739fa4394c- > 13:33:35.775645 [slave.cpp:3002] Handling status update TASK_KILLED (UUID: > eba64915-7df2-483d-8982-a9a46a48a81b) for task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework > f65b163c-0faf-441f-ac14-91739fa4394c- f > rom @0.0.0.0:0
[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check
[ https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-5987: --- Shepherd: Alexander Rukletsov Sprint: Mesosphere Sprint 40 Story Points: 3 > Update health check protobuf for HTTP and TCP health check > -- > > Key: MESOS-5987 > URL: https://issues.apache.org/jira/browse/MESOS-5987 > Project: Mesos > Issue Type: Task >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere > Fix For: 1.1.0 > > > To support HTTP and TCP health check, we need to update the existing > {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] > commented in https://reviews.apache.org/r/36816/ and > https://reviews.apache.org/r/49360/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3325) Running mesos-slave@0.23 in a container causes slave to be lost after a restart
[ https://issues.apache.org/jira/browse/MESOS-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411550#comment-15411550 ] Lei Xu commented on MESOS-3325: --- Hi, We hit this issue months ago, mesos agent always read boot_id from host os and re-generate the slave id and register with master, I remember here is a issue to track this, but I forget the issue id, you can give a boot id to the agent to make sure the slave id do not change when restart. > Running mesos-slave@0.23 in a container causes slave to be lost after a > restart > --- > > Key: MESOS-3325 > URL: https://issues.apache.org/jira/browse/MESOS-3325 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.23.0 > Environment: CoreOS, Container, Docker >Reporter: Chris Fortier >Priority: Critical > > We are attempting to run mesos-slave 0.23 in a container. However it appears > that the mesos-slave agent registers as a new slave instead of > re-registering. This causes the formerly-launched tasks to continue running. > systemd unit being used: > ``` > [Unit] > Description=MesosSlave > After=docker.service dockercfg.service > Requires=docker.service dockercfg.service > [Service] > Environment=MESOS_IMAGE=mesosphere/mesos-slave:0.23.0-1.0.ubuntu1404 > Environment=ZOOKEEPER=redacted > User=core > KillMode=process > Restart=always > RestartSec=20 > TimeoutStartSec=0 > ExecStartPre=-/usr/bin/docker kill mesos_slave > ExecStartPre=-/usr/bin/docker rm mesos_slave > ExecStartPre=/usr/bin/docker pull ${MESOS_IMAGE} > ExecStart=/usr/bin/sh -c "sudo /usr/bin/docker run \ > --name=mesos_slave \ > --net=host \ > --pid=host \ > --privileged \ > -v /home/core/.dockercfg:/root/.dockercfg:ro \ > -v /sys:/sys \ > -v /usr/bin/docker:/usr/bin/docker:ro \ > -v /var/run/docker.sock:/var/run/docker.sock \ > -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \ > -v /var/lib/mesos/slave:/var/lib/mesos/slave \ > ${MESOS_IMAGE} \ > --ip=`curl -s http://169.254.169.254/latest/meta-data/local-ipv4` \ > --attributes=zone:$(curl -s > http://169.254.169.254/latest/meta-data/placement/availability-zone)\;os:coreos > \ > --containerizers=docker,mesos \ > --executor_registration_timeout=10mins \ > --hostname=`curl -s > http://169.254.169.254/latest/meta-data/public-hostname` \ > --log_dir=/var/log/mesos \ > --master=zk://${ZOOKEEPER}/mesos \ > --work_dir=/var/lib/mesos/slave" > ExecStop=/usr/bin/docker stop mesos_slave > [Install] > WantedBy=multi-user.target > [X-Fleet] > Global=true > MachineMetadata=role=worker > ``` > ps, yes I saw the coreos-setup repo was deprecated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4440) Clean get/post/deleteRequest func and let the caller to use the general funcion.
[ https://issues.apache.org/jira/browse/MESOS-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411362#comment-15411362 ] Yongqiao Wang commented on MESOS-4440: -- [~adam-mesos] I plan to clean up the code described in this ticket, do you have time and give me a review? I will submit patches later. > Clean get/post/deleteRequest func and let the caller to use the general > funcion. > > > Key: MESOS-4440 > URL: https://issues.apache.org/jira/browse/MESOS-4440 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Yongqiao Wang >Assignee: Yongqiao Wang >Priority: Minor > Labels: tech-debt > -- This message was sent by Atlassian JIRA (v6.3.4#6332)