[jira] [Assigned] (MESOS-5078) Document TaskStatus reasons
[ https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-5078: -- Assignee: Benno Evers > Document TaskStatus reasons > --- > > Key: MESOS-5078 > URL: https://issues.apache.org/jira/browse/MESOS-5078 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Benno Evers > Labels: documentation, mesosphere, newbie++ > > We should document the possible {{reason}} values that can be found in the > {{TaskStatus}} message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-5078) Document TaskStatus reasons
[ https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-5078: --- Sprint: Mesosphere Sprint 61 > Document TaskStatus reasons > --- > > Key: MESOS-5078 > URL: https://issues.apache.org/jira/browse/MESOS-5078 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Benno Evers > Labels: documentation, mesosphere, newbie++ > > We should document the possible {{reason}} values that can be found in the > {{TaskStatus}} message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7876) Investigate jemalloc as a possible malloc for mesos
Benno Evers created MESOS-7876: -- Summary: Investigate jemalloc as a possible malloc for mesos Key: MESOS-7876 URL: https://issues.apache.org/jira/browse/MESOS-7876 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Assignee: Benno Evers It is currently very hard to debug memory issues, in particular memory leaks, in mesos. An alluring way to improve the situation would be to change the default malloc to jemalloc, which has built-in heap-tracking capabilities. However, some care needs to be taken when considering to change such a fundamental part of mesos: * Would such a switch have any adverse impact on performance? * Is it available and will it compile on all our target platforms? * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-5078) Document TaskStatus reasons
[ https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123313#comment-16123313 ] Benno Evers commented on MESOS-5078: Review: https://reviews.apache.org/r/61495/ > Document TaskStatus reasons > --- > > Key: MESOS-5078 > URL: https://issues.apache.org/jira/browse/MESOS-5078 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Benno Evers > Labels: documentation, mesosphere, newbie++ > > We should document the possible {{reason}} values that can be found in the > {{TaskStatus}} message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7773) HTTP request validation stage is not explicit.
[ https://issues.apache.org/jira/browse/MESOS-7773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125532#comment-16125532 ] Benno Evers commented on MESOS-7773: While we're at it, we should also make sure that we always return BadRequest on malformed user input instead of `CHECK`-ing and aborting. Right now, there are some places where it looks like we're asserting certain properties of user-passed protobuf messages, for example the local authorizer seems to `CHECK` that certain fields of the passed protobuf message was set. (src/authorizer/local/authorizer.cpp:312) > HTTP request validation stage is not explicit. > -- > > Key: MESOS-7773 > URL: https://issues.apache.org/jira/browse/MESOS-7773 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Alexander Rukletsov > Labels: mesosphere, reliability > > Currently we validate HTTP requests in multiple places in libprocess, for > instance {{ProcessManager::handle()}}, {{StreamingRequestDecoder::decode()}}, > {{process::parse()}}. To improve error handling when dealing with malformed > HTTP requests (including libprocess messages), consider introducing a > validation stage and / or make sure {{Request}} and all its components are in > valid state before we start using it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7876) Investigate alternative malloc's for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7876: --- Summary: Investigate alternative malloc's for mesos (was: Investigate jemalloc as a possible malloc for mesos) > Investigate alternative malloc's for mesos > -- > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such a > fundamental part of mesos: > * Would such a switch have any adverse impact on performance? > * Is it available and will it compile on all our target platforms? > * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7876) Investigate alternative malloc's for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127194#comment-16127194 ] Benno Evers commented on MESOS-7876: Licensing: 2-clause BSD, there should be no problem. Availability: jemalloc uses a standard autotools-based build, so adding it to our build should be no problem. As far as I know, mesos allocates all memory using operator new which is a standard interface, so there should be no platform-specific problems. Performance: To test malloc performance, I compiled two versions of jemalloc 4.5.0 with the default configuration options used in [https://www.freebsd.org/cgi/man.cgi?jemalloc(3)](FreeBSD), i.e. `--enable-fill --enable-lazy-lock --enable-munmap --enable-tcache --enable-tls --enable-utrace --enable-xmalloc`. For one of them, I addtionally specified the flags `--enable-stats --enable-prof` to disable heap statistics gathering and profiling options, for the other I specified `--disable-stats --disable-prof`. Next, I spawned n threads per allocator (i.e. 3*n threads in total) and had each thread do 125.000 allocation and deallocation operations with memory regions uniformly distributed between 1 byte and 64 MiB. All three allocators were running at the same time to ensure the system base load was the same for all of them. !noprof.png|Results run 1! !prof.png|Results run 2! More or less as predicted by other peoples experience, these results show that the heap tracking functionality has almost no runtime impact when enabled but not actively used, and as a bonus jemalloc actually seems to have a substantial speedup for multi-threaded allocations, although its debatable if this will be noticable during normal operation. I didn't manage to get a clean measurement from mesos own' benchmark tests yet. This post by Facebook describes some implementation details of jemalloc, along with a very extensive comparison of several malloc implementations, although it seems the actual results are missing from the page: https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/ > Investigate alternative malloc's for mesos > -- > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, > prof.png > > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such a > fundamental part of mesos: > * Would such a switch have any adverse impact on performance? > * Is it available and will it compile on all our target platforms? > * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7876) Investigate alternative malloc's for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7876: --- Attachment: noprof.png prof.png malloc.cpp jemalloc_benchmark_raw.txt > Investigate alternative malloc's for mesos > -- > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, > prof.png > > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such a > fundamental part of mesos: > * Would such a switch have any adverse impact on performance? > * Is it available and will it compile on all our target platforms? > * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7876) Investigate alternative malloc's for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127205#comment-16127205 ] Benno Evers commented on MESOS-7876: One other option we should probably keep in mind is _tcmalloc_, which is another malloc implementation created at google that has a lot of the same promises that `jemalloc` has (i.e. drastically faster allocation times and optional heap tracking support) and is already included in our dependencies because it is part of {{gperftools}}. On the one hand this would avoid adding an additional dependency, on the other hand it could also lead to additional problems because some other 3rdparty-dependencies also try to link against tcmalloc if it is available at build-time, so we might end up using several different versions of it if the bundled version is different than the one installed on the system and one of the involved the build systems doesn't handle this situation correctly. > Investigate alternative malloc's for mesos > -- > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, > prof.png > > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such a > fundamental part of mesos: > * Would such a switch have any adverse impact on performance? > * Is it available and will it compile on all our target platforms? > * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7876) Investigate alternative malloc's for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127194#comment-16127194 ] Benno Evers edited comment on MESOS-7876 at 8/15/17 1:24 PM: - Licensing: 2-clause BSD, there should be no problem. Availability: jemalloc uses a standard autotools-based build, so adding it to our build should be no problem. As far as I know, mesos allocates all memory using operator new which is a standard interface, so there should be no platform-specific problems. Performance: To test malloc performance, I compiled two versions of jemalloc 4.5.0 with the default configuration options used in FreeBSD ( [https://www.freebsd.org/cgi/man.cgi?jemalloc(3)] ), i.e. {{--enable-fill --enable-lazy-lock --enable-munmap --enable-tcache --enable-tls --enable-utrace --enable-xmalloc}}. For one of them, I addtionally specified the flags `--enable-stats --enable-prof` to disable heap statistics gathering and profiling options, for the other I specified `--disable-stats --disable-prof`. Next, I spawned n threads per allocator (i.e. 3*n threads in total) and had each thread do 125.000 allocation and deallocation operations with memory regions uniformly distributed between 1 byte and 64 MiB. All three allocators were running at the same time to ensure the system base load was the same for all of them. !noprof.png|Results run 1! !prof.png|Results run 2! More or less as predicted by other peoples experience, these results show that the heap tracking functionality has almost no runtime impact when enabled but not actively used, and as a bonus jemalloc actually seems to have a substantial speedup for multi-threaded allocations, although its debatable if this will be noticable during normal operation. I didn't manage to get a clean measurement from mesos own' benchmark tests yet. This post by Facebook describes some implementation details of jemalloc, along with a very extensive comparison of several malloc implementations, although it seems the actual results are missing from the page: https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/ was (Author: bennoe): Licensing: 2-clause BSD, there should be no problem. Availability: jemalloc uses a standard autotools-based build, so adding it to our build should be no problem. As far as I know, mesos allocates all memory using operator new which is a standard interface, so there should be no platform-specific problems. Performance: To test malloc performance, I compiled two versions of jemalloc 4.5.0 with the default configuration options used in [https://www.freebsd.org/cgi/man.cgi?jemalloc(3)](FreeBSD), i.e. `--enable-fill --enable-lazy-lock --enable-munmap --enable-tcache --enable-tls --enable-utrace --enable-xmalloc`. For one of them, I addtionally specified the flags `--enable-stats --enable-prof` to disable heap statistics gathering and profiling options, for the other I specified `--disable-stats --disable-prof`. Next, I spawned n threads per allocator (i.e. 3*n threads in total) and had each thread do 125.000 allocation and deallocation operations with memory regions uniformly distributed between 1 byte and 64 MiB. All three allocators were running at the same time to ensure the system base load was the same for all of them. !noprof.png|Results run 1! !prof.png|Results run 2! More or less as predicted by other peoples experience, these results show that the heap tracking functionality has almost no runtime impact when enabled but not actively used, and as a bonus jemalloc actually seems to have a substantial speedup for multi-threaded allocations, although its debatable if this will be noticable during normal operation. I didn't manage to get a clean measurement from mesos own' benchmark tests yet. This post by Facebook describes some implementation details of jemalloc, along with a very extensive comparison of several malloc implementations, although it seems the actual results are missing from the page: https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/ > Investigate alternative malloc's for mesos > -- > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, > prof.png > > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such
[jira] [Commented] (MESOS-7819) Libprocess internal state is not monitored by metrics.
[ https://issues.apache.org/jira/browse/MESOS-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127362#comment-16127362 ] Benno Evers commented on MESOS-7819: For the metrics where we think they might be occasionally useful for debugging but are worried about exposing too much internal state (points 1,2,5), maybe another idea would be to introduce something like private metrics, which would essentially be something like a {{volatile static int64_t}} (so all modifications are preserved even at high optimization levels, but the only way to actually see the value would be through a debugger) Some thoughts about the individual proposed metrics, it seems to me like any single one wouldn't be very useful because it's hard to say in isolation how many actors/connections/messages are "normal" for the different parts of mesos, but having multiple of them it would become possible to compare their ratios to known "normal" ranges and maybe pinpoint the fault location more precisely. In particular, average number of pending messages might be useful not only for debugging but also for performance regression tests in the future. > Libprocess internal state is not monitored by metrics. > -- > > Key: MESOS-7819 > URL: https://issues.apache.org/jira/browse/MESOS-7819 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Alexander Rukletsov > Labels: metrics, newbie++ > > Libprocess does not expose its internal state via metrics. Active sockets, > number of HTTP proxies, number of running actors, number of pending messages > for all active sockets, etc — may be of interest when monitoring and > debugging Mesos clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7876) Investigate alternative malloc implementations for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7876: --- Summary: Investigate alternative malloc implementations for mesos (was: Investigate alternative malloc's for mesos) > Investigate alternative malloc implementations for mesos > > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, > prof.png > > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such a > fundamental part of mesos: > * Would such a switch have any adverse impact on performance? > * Is it available and will it compile on all our target platforms? > * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7876) Investigate alternative malloc implementations for mesos
[ https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127588#comment-16127588 ] Benno Evers commented on MESOS-7876: Spot-checking some of the mesos benchmarks using jemalloc vs. system malloc, I can observe a small but consistent speedup from 1% to 6% using jemalloc over glibc. There certainly is no indication that switching to jemalloc would lead to performance regressions. With jemalloc: {code} [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/1 (575213 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/1 (1963 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/10 (18756 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/11 (37044 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/12 (97298 ms) [ OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 (302 ms) [ OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 (2311 ms) [ OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 (12104 ms) {code} With default malloc: {code} [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/1 (610002 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/1 (2065 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/10 (20207 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/11 (38086 ms) [ OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/12 (98475 ms) [ OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 (281 ms) [ OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 (2448 ms) [ OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 (12673 ms) {code} > Investigate alternative malloc implementations for mesos > > > Key: MESOS-7876 > URL: https://issues.apache.org/jira/browse/MESOS-7876 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Assignee: Benno Evers > Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, > prof.png > > > It is currently very hard to debug memory issues, in particular memory leaks, > in mesos. > An alluring way to improve the situation would be to change the default > malloc to jemalloc, which has built-in heap-tracking capabilities. > However, some care needs to be taken when considering to change such a > fundamental part of mesos: > * Would such a switch have any adverse impact on performance? > * Is it available and will it compile on all our target platforms? > * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147024#comment-16147024 ] Benno Evers commented on MESOS-7699: I also experienced this, and I think the correct way to handle it is to revert the usage of `-isystem` back to `-I`, and then to either disable building with `-Werror` by default (my preferred choice) or to add `-Wno-deprecated-declarations` to the default build flags. The reasoning here is that using `-Werror` implies that we're committing to fix at least all warnings that occur with our supported list of compilers and dependencies, but as the original boost bug showed we are not willing and don't have the resources to do that. (I think the fact that we have a `--disable-werror` configure flag also shows that this would be a useful thing to do) Alternatively, while I agree with the view that `-Wno-deprecated-declarations` will potentially hide useful warnings, having these warnings is in my opinion less important than being able to build with non-bundled versions of boost and protobuf. > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile > Labels: autotools > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7941) Send TASK_STARTING status from built-in executors
Benno Evers created MESOS-7941: -- Summary: Send TASK_STARTING status from built-in executors Key: MESOS-7941 URL: https://issues.apache.org/jira/browse/MESOS-7941 Project: Mesos Issue Type: Bug Reporter: Benno Evers Assignee: Benno Evers All executors have the option to send out a TASK_STARTING status update to signal to the scheduler that they received the command to launch the task. It would be good if our built-in executors would do this, for reasons laid out in https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E This will also fix MESOS-6790. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7941) Send TASK_STARTING status from built-in executors
[ https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155483#comment-16155483 ] Benno Evers commented on MESOS-7941: Review: https://reviews.apache.org/r/62123/ > Send TASK_STARTING status from built-in executors > - > > Key: MESOS-7941 > URL: https://issues.apache.org/jira/browse/MESOS-7941 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > All executors have the option to send out a TASK_STARTING status update to > signal to the scheduler that they received the command to launch the task. > It would be good if our built-in executors would do this, for reasons laid > out in > https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E > This will also fix MESOS-6790. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7941) Send TASK_STARTING status from built-in executors
[ https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155487#comment-16155487 ] Benno Evers commented on MESOS-7941: PR to update Chronos to correctly handle these new updates: https://github.com/mesos/chronos/pull/854 > Send TASK_STARTING status from built-in executors > - > > Key: MESOS-7941 > URL: https://issues.apache.org/jira/browse/MESOS-7941 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > All executors have the option to send out a TASK_STARTING status update to > signal to the scheduler that they received the command to launch the task. > It would be good if our built-in executors would do this, for reasons laid > out in > https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E > This will also fix MESOS-6790. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7944) Implement jemalloc support for Mesos
Benno Evers created MESOS-7944: -- Summary: Implement jemalloc support for Mesos Key: MESOS-7944 URL: https://issues.apache.org/jira/browse/MESOS-7944 Project: Mesos Issue Type: Bug Reporter: Benno Evers Assignee: Benno Evers After investigation in MESOS-7876 and discussion on the mailing list, this task is for tracking progress on adding out-of-the-box memory profiling support using jemalloc to Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7944) Implement jemalloc support for Mesos
[ https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161001#comment-16161001 ] Benno Evers commented on MESOS-7944: Since I've started to work on this, I have now a much sharper idea of what needs to be done. First of all, since the added features are not mesos-specific, I think it's best to add them directly to libprocess. However, the choice of preferred malloc should be up the binary, not enforced by a shared library, so instead compiling against jemalloc we should detect at runtime whether we're running under jemalloc or not. (similar to what folly does here: https://github.com/facebook/folly/blob/master/folly/Malloc.h#L150) At the endpoint, the minimum features I would like are the ability to get the (exact) heap allocation statistics as JSON, or download current (stochastic) heap profile dumps as files. Depending on the complexity of it, we should also think about providing a way to have the master dump profiles periodically and store them on disk, and a way to generate jeprof-graphs automatically. Finally, the new `--enable-memory-profiling` configure option (tentative name) for mesos would build a bundled version of jemalloc with all the necessary configuration options enabled, and link the mesos-master and mesos-slave binaries against this library. > Implement jemalloc support for Mesos > > > Key: MESOS-7944 > URL: https://issues.apache.org/jira/browse/MESOS-7944 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > After investigation in MESOS-7876 and discussion on the mailing list, this > task is for tracking progress on adding out-of-the-box memory profiling > support using jemalloc to Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7944) Implement jemalloc support for Mesos
[ https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161001#comment-16161001 ] Benno Evers edited comment on MESOS-7944 at 9/11/17 9:59 AM: - Since I've started to work on this, I have now a much better idea of what needs to be done. First of all, since the added features are not mesos-specific, I think it's best to add them directly to libprocess. However, the choice of preferred malloc should be up the binary, not enforced by a shared library, so instead compiling against jemalloc we should detect at runtime whether we're running under jemalloc or not. (similar to what folly does here: https://github.com/facebook/folly/blob/master/folly/Malloc.h#L150) At the endpoint, the minimum features I would like are the ability to get the (exact) heap allocation statistics as JSON, or download current (stochastic) heap profile dumps as files. Depending on the complexity of it, we should also think about providing a way to have the master dump profiles periodically and store them on disk, and a way to generate jeprof-graphs automatically. Finally, the new `--enable-memory-profiling` configure option (tentative name) for mesos would build a bundled version of jemalloc with all the necessary configuration options enabled, and link the mesos-master and mesos-slave binaries against this library. was (Author: bennoe): Since I've started to work on this, I have now a much sharper idea of what needs to be done. First of all, since the added features are not mesos-specific, I think it's best to add them directly to libprocess. However, the choice of preferred malloc should be up the binary, not enforced by a shared library, so instead compiling against jemalloc we should detect at runtime whether we're running under jemalloc or not. (similar to what folly does here: https://github.com/facebook/folly/blob/master/folly/Malloc.h#L150) At the endpoint, the minimum features I would like are the ability to get the (exact) heap allocation statistics as JSON, or download current (stochastic) heap profile dumps as files. Depending on the complexity of it, we should also think about providing a way to have the master dump profiles periodically and store them on disk, and a way to generate jeprof-graphs automatically. Finally, the new `--enable-memory-profiling` configure option (tentative name) for mesos would build a bundled version of jemalloc with all the necessary configuration options enabled, and link the mesos-master and mesos-slave binaries against this library. > Implement jemalloc support for Mesos > > > Key: MESOS-7944 > URL: https://issues.apache.org/jira/browse/MESOS-7944 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > After investigation in MESOS-7876 and discussion on the mailing list, this > task is for tracking progress on adding out-of-the-box memory profiling > support using jemalloc to Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7941) Send TASK_STARTING status from built-in executors
[ https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155483#comment-16155483 ] Benno Evers edited comment on MESOS-7941 at 9/15/17 8:51 AM: - Review: https://reviews.apache.org/r/62212/ was (Author: bennoe): Review: https://reviews.apache.org/r/62123/ > Send TASK_STARTING status from built-in executors > - > > Key: MESOS-7941 > URL: https://issues.apache.org/jira/browse/MESOS-7941 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > All executors have the option to send out a TASK_STARTING status update to > signal to the scheduler that they received the command to launch the task. > It would be good if our built-in executors would do this, for reasons laid > out in > https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E > This will also fix MESOS-6790. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7994) Hard-coded protobuf version in mesos.pom.in
Benno Evers created MESOS-7994: -- Summary: Hard-coded protobuf version in mesos.pom.in Key: MESOS-7994 URL: https://issues.apache.org/jira/browse/MESOS-7994 Project: Mesos Issue Type: Bug Reporter: Benno Evers Currently, the version of protobuf.jar used by maven is hardcoded in `src/java/mesos.pom.in` to be 3.3.0. When building against a non-bundled version of protobuf, this will likely cause a version mismatch which can lead to build errors because the java build is trying to compile the java source files created by the protoc of the non-bundled protobuf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
Benno Evers created MESOS-8005: -- Summary: Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky Key: MESOS-8005 URL: https://issues.apache.org/jira/browse/MESOS-8005 Project: Mesos Issue Type: Bug Reporter: Benno Evers Executed on Ubuntu 17.04 w/ SSL enabled: {code} ../../src/tests/cluster.cpp:580 Value of: containers->empty() Actual: false Expected: true Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
[ https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8005: --- Attachment: jenkins.log.gz Full log. > Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky > - > > Key: MESOS-8005 > URL: https://issues.apache.org/jira/browse/MESOS-8005 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers > Attachments: jenkins.log.gz > > > Executed on Ubuntu 17.04 w/ SSL enabled: > {code} > ../../src/tests/cluster.cpp:580 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
[ https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176745#comment-16176745 ] Benno Evers edited comment on MESOS-8005 at 9/22/17 5:19 PM: - Sure, I attached the full log. The build was started for commit 548aaee3a8f5935457767db1e3b761d873b09cbf on the master branch, but I highly doubt that this commit caused the test failure. was (Author: bennoe): Full log. > Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky > - > > Key: MESOS-8005 > URL: https://issues.apache.org/jira/browse/MESOS-8005 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers > Attachments: jenkins.log.gz > > > Executed on Ubuntu 17.04 w/ SSL enabled: > {code} > ../../src/tests/cluster.cpp:580 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
[ https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176745#comment-16176745 ] Benno Evers edited comment on MESOS-8005 at 9/22/17 5:20 PM: - Sure, I attached the full log. The build was started for commit 548aaee3a8f5935457767db1e3b761d873b09cbf on the master branch, but I highly doubt that this commit caused the test failure: {code} bevers@poincare:~/src/mesos/worktrees/master$ git show 548aaee3a8f5935457767db1e3b761d873b09cbf --stat commit 548aaee3a8f5935457767db1e3b761d873b09cbf Author: Tomasz Janiszewski Date: Thu Sep 21 16:16:06 2017 -0700 Display task state counters in the framework page. Fixes MESOS-7962. This closes #234 src/webui/master/static/framework.html| 42 ++ src/webui/master/static/js/controllers.js | 30 ++ 2 files changed, 72 insertions(+) {code} was (Author: bennoe): Sure, I attached the full log. The build was started for commit 548aaee3a8f5935457767db1e3b761d873b09cbf on the master branch, but I highly doubt that this commit caused the test failure. > Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky > - > > Key: MESOS-8005 > URL: https://issues.apache.org/jira/browse/MESOS-8005 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers > Attachments: jenkins.log.gz > > > Executed on Ubuntu 17.04 w/ SSL enabled: > {code} > ../../src/tests/cluster.cpp:580 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8023) Warn users trying to use HTTP Basic Authentication over non-secure channels
Benno Evers created MESOS-8023: -- Summary: Warn users trying to use HTTP Basic Authentication over non-secure channels Key: MESOS-8023 URL: https://issues.apache.org/jira/browse/MESOS-8023 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Since the Basic authentication submits passwords and usernames in plain text, it should only be used when the connection is already secured through another layer, e.g. when using HTTPS. Since many users are not aware of this fact, Mesos should try to detect warn about this situation where possible, to prevent accidental leaking of passwords. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8047) SubprocessTest.Status does not always receive a signal
Benno Evers created MESOS-8047: -- Summary: SubprocessTest.Status does not always receive a signal Key: MESOS-8047 URL: https://issues.apache.org/jira/browse/MESOS-8047 Project: Mesos Issue Type: Bug Reporter: Benno Evers This one seems to be different from MESOS-1705 and MESOS-1738. It might be that previous test runs leave a mesos process running in the background, but I didn't investigate very deeply: {code} [ RUN ] SubprocessTest.Status /home/bevers/src/mesos/worktrees/master/3rdparty/libprocess/src/tests/subprocess_tests.cpp:281: Failure Expecting WIFSIGNALED(s.get().status()()->get()) but WIFEXITED(s.get().status()()->get()) is true and WEXITSTATUS(s.get().status()()->get()) is 0 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7941) Send TASK_STARTING status from built-in executors
[ https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7941: --- Sprint: Mesosphere Sprint 65 > Send TASK_STARTING status from built-in executors > - > > Key: MESOS-7941 > URL: https://issues.apache.org/jira/browse/MESOS-7941 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > All executors have the option to send out a TASK_STARTING status update to > signal to the scheduler that they received the command to launch the task. > It would be good if our built-in executors would do this, for reasons laid > out in > https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E > This will also fix MESOS-6790. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7944) Implement jemalloc support for Mesos
[ https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7944: --- Sprint: Mesosphere Sprint 63, Mesosphere Sprint 65 (was: Mesosphere Sprint 63) > Implement jemalloc support for Mesos > > > Key: MESOS-7944 > URL: https://issues.apache.org/jira/browse/MESOS-7944 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > > After investigation in MESOS-7876 and discussion on the mailing list, this > task is for tracking progress on adding out-of-the-box memory profiling > support using jemalloc to Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6790) Wrong task started time in webui
[ https://issues.apache.org/jira/browse/MESOS-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-6790: --- Sprint: Mesosphere Sprint 65 > Wrong task started time in webui > > > Key: MESOS-6790 > URL: https://issues.apache.org/jira/browse/MESOS-6790 > Project: Mesos > Issue Type: Bug > Components: webui >Reporter: haosdent >Assignee: Benno Evers > Labels: health-check, mesosphere, observability, webui > > Reported by [~janisz] > {quote} > Hi > When task has enabled Mesos healthcheck start time in UI can show wrong > time. This happens because UI assumes that first status is task started > [0]. This is not always true because Mesos keeps only recent tasks statuses > [1] so when healthcheck updates tasks status it can override task start > time displayed in webui. > Best > Tomek > [0] > https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L140 > [1] > https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64300da19047/src/common/protobuf_utils.cpp#L263-L265 > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6790) Wrong task started time in webui
[ https://issues.apache.org/jira/browse/MESOS-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-6790: -- Assignee: Benno Evers (was: Tomasz Janiszewski) > Wrong task started time in webui > > > Key: MESOS-6790 > URL: https://issues.apache.org/jira/browse/MESOS-6790 > Project: Mesos > Issue Type: Bug > Components: webui >Reporter: haosdent >Assignee: Benno Evers > Labels: health-check, mesosphere, observability, webui > > Reported by [~janisz] > {quote} > Hi > When task has enabled Mesos healthcheck start time in UI can show wrong > time. This happens because UI assumes that first status is task started > [0]. This is not always true because Mesos keeps only recent tasks statuses > [1] so when healthcheck updates tasks status it can override task start > time displayed in webui. > Best > Tomek > [0] > https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L140 > [1] > https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64300da19047/src/common/protobuf_utils.cpp#L263-L265 > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-7699: -- Assignee: Benno Evers > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile >Assignee: Benno Evers > Labels: autotools > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202283#comment-16202283 ] Benno Evers commented on MESOS-7699: I posted a review chain to fix this (along with follow-up issues when building against unbundled versions of boost and protobuf) at https://reviews.apache.org/r/62160/ > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile >Assignee: Benno Evers > Labels: autotools > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7699: --- Shepherd: Benjamin Bannier Sprint: Mesosphere Sprint 66 Story Points: 3 > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile >Assignee: Benno Evers > Labels: autotools > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8217) Don't run linters on every commit
Benno Evers created MESOS-8217: -- Summary: Don't run linters on every commit Key: MESOS-8217 URL: https://issues.apache.org/jira/browse/MESOS-8217 Project: Mesos Issue Type: Bug Reporter: Benno Evers The mesos `pre-commit` hook is currently running several linters on the source code, some of which are even dynamically installed from the internet during a commit. This can hinder development because it also applies to local commits that are not intended to be ever published, and can quickly become annoying when rebasing old branches. Instead, we should think about putting these hooks into a separate `support/verify-reviews.py` which would be executed when trying to post a review, since at this point the patches should be cleaned up and pass all linter checks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8273) Incorrect master state due to fast agent re-registration
Benno Evers created MESOS-8273: -- Summary: Incorrect master state due to fast agent re-registration Key: MESOS-8273 URL: https://issues.apache.org/jira/browse/MESOS-8273 Project: Mesos Issue Type: Bug Reporter: Benno Evers Currently, when a mesos agent attempts to reregister while a previous reregistration attempt is still on-going, the new attempt is discarded and the old is allowed to continue. This can lead to an inconsistent master state, when the agent gained new capabilities or a new version between restarts which are only present in the newer reregistration message. Ideally, we should abort the old reregistration attempt and let the new one continue, but this requires some restructuring of the agent reregistration codepath. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8303) Add user doc for agent reconfiguration
[ https://issues.apache.org/jira/browse/MESOS-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8303: --- Sprint: Mesosphere Sprint 70 > Add user doc for agent reconfiguration > -- > > Key: MESOS-8303 > URL: https://issues.apache.org/jira/browse/MESOS-8303 > Project: Mesos > Issue Type: Documentation >Reporter: Vinod Kone >Assignee: Benno Evers > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8291) Add documentation about fault domains
[ https://issues.apache.org/jira/browse/MESOS-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8291: --- Sprint: Mesosphere Sprint 70 > Add documentation about fault domains > - > > Key: MESOS-8291 > URL: https://issues.apache.org/jira/browse/MESOS-8291 > Project: Mesos > Issue Type: Documentation >Reporter: Vinod Kone >Assignee: Benno Evers > > We need some user docs for fault domains. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8245) SlaveRecoveryTest/0.ReconnectExecutor is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8245: --- Sprint: Mesosphere Sprint 70 Story Points: 3 > SlaveRecoveryTest/0.ReconnectExecutor is flaky. > --- > > Key: MESOS-8245 > URL: https://issues.apache.org/jira/browse/MESOS-8245 > Project: Mesos > Issue Type: Bug > Components: test > Environment: Ubuntu 17.04 >Reporter: Alexander Rukletsov >Assignee: Benno Evers > Labels: flaky-test > Attachments: ReconnectExecutor-badrun.txt, > ReconnectExecutor-goodrun.txt > > > Observed it today in our CI. Logs attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8115) Add a master flag to disallow agents that are not configured with fault domain
[ https://issues.apache.org/jira/browse/MESOS-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290982#comment-16290982 ] Benno Evers commented on MESOS-8115: Review: https://reviews.apache.org/r/64507/ > Add a master flag to disallow agents that are not configured with fault domain > -- > > Key: MESOS-8115 > URL: https://issues.apache.org/jira/browse/MESOS-8115 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Benno Evers > > Once mesos masters and agents in a cluster are *all* upgraded to a version > where the fault domains feature is available, it is beneficial to enforce > that agents without a fault domain configured are not allowed to join the > cluster. > This is a safety net for operators who could forget to configure the fault > domain of a remote agent and let it join the cluster. If this happens, an > agent in a remote region will be considered a local agent by the master and > frameworks (because agent's fault domain is not configured) causing tasks to > potentially land in a remote agent which is undesirable. > Note that this has to be a configurable flag and not enforced by default > because otherwise upgrades from a fault domain non-configured cluster to a > configured cluster will not be possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky
Benno Evers created MESOS-8336: -- Summary: MasterTest.RegistryUpdateAfterReconfiguration is flaky Key: MESOS-8336 URL: https://issues.apache.org/jira/browse/MESOS-8336 Project: Mesos Issue Type: Bug Reporter: Benno Evers Observed here: https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/ The test here failed because the registry contained 2 slaves, when it should have only one. Looking through the log, everything seems normal (in particular, only 1 slave id appears throughout this test). The only thing out of the ordinary seems to be the agent sending two `RegisterSlaveMessage`s and two `ReregisterSlaveMessage`s, but looking at the code for generating the random backoff factor in the slave that seems to be more or less normal, and shouldn't break the test. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8341) Agent can become stuck in (re-)registering state during upgrades
Benno Evers created MESOS-8341: -- Summary: Agent can become stuck in (re-)registering state during upgrades Key: MESOS-8341 URL: https://issues.apache.org/jira/browse/MESOS-8341 Project: Mesos Issue Type: Bug Reporter: Benno Evers Currently, an agent will not be erased from the set of currently (re-)registering agents if - it tries to (re-)register with a malformed version string - it tries to (re-)register with a version smaller than the minimum supported version - it tries to (re-)register with a domain when the master has no domain configured - the operator marks the slave as gone while the (re-)registration is ongoing Afterwards, all further (re-)registration attempts with the same agent id will be discarded, because the master still thinks that the original (re-)registration is ongoing. Since most realistic way to encounter this issue would be during cluster upgrades, and it will fix itself with a master restart, it is unlikely to be reported externally. Review: https://reviews.apache.org/r/64506 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart
[ https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311352#comment-16311352 ] Benno Evers commented on MESOS-8391: I could confirm this behaviour with mesos 1.5 on a DC/OS 1.11 cluster. For case (2), while the system state eventually returns to normal and marathon correctly re-schedules the two tasks, the original task seems to stay in the `TASK_KILLING` state indefinitely. >From a quick look at the logs, the agent gets as far as "Checkpointing >termination state to nested container's runtime directory", but never attempts >to destroy the parent container afterwards. I'm currently looking at the >container destruction code path to see what the expected behaviour would be. > Mesos agent doesn't notice that a pod task exits or crashes after the agent > restart > --- > > Key: MESOS-8391 > URL: https://issues.apache.org/jira/browse/MESOS-8391 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, executor >Affects Versions: 1.5.0 >Reporter: Ivan Chernetsky >Priority: Critical > > h4. (1) Agent doesn't detect that a pod task exits/crashes > # Create a Marathon pod with two containers which just do {{sleep 1}}. > # Restart the Mesos agent on the node the pod got launched. > # Kill one of the pod tasks > *Expected result*: The Mesos agent detects that one of the tasks got killed, > and forwards {{TASK_FAILED}} status to Marathon. > *Actual result*: The Mesos agent does nothing, and the Mesos master thinks > that both tasks are running just fine. Marathon doesn't take any action > because it doesn't receive any update from Mesos. > h4. (2) After the agent restart, it detects that the task crashed, forwards > the correct status update, but the other task stays in {{TASK_KILLING}} state > forever > # Perform steps in (1). > # Restart the Mesos agent > *Expected result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, and kills the other task too. > *Actual result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, but the other task stays in > `TASK_KILLING` state forever. > Please note, that after another agent restart, the other tasks gets finally > killed and the correct status updates get propagated all the way to Marathon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart
[ https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311619#comment-16311619 ] Benno Evers commented on MESOS-8391: Attached a log where Jan 04 11:30:06 <- Started two sleep tasks Jan 04 11:33:14 <- Agent restart Jan 04 11:33:53 <- Killed one of the tasks Jan 04 11:35:08 <- Second agent restart > Mesos agent doesn't notice that a pod task exits or crashes after the agent > restart > --- > > Key: MESOS-8391 > URL: https://issues.apache.org/jira/browse/MESOS-8391 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, executor >Affects Versions: 1.5.0 >Reporter: Ivan Chernetsky >Priority: Critical > > h4. (1) Agent doesn't detect that a pod task exits/crashes > # Create a Marathon pod with two containers which just do {{sleep 1}}. > # Restart the Mesos agent on the node the pod got launched. > # Kill one of the pod tasks > *Expected result*: The Mesos agent detects that one of the tasks got killed, > and forwards {{TASK_FAILED}} status to Marathon. > *Actual result*: The Mesos agent does nothing, and the Mesos master thinks > that both tasks are running just fine. Marathon doesn't take any action > because it doesn't receive any update from Mesos. > h4. (2) After the agent restart, it detects that the task crashed, forwards > the correct status update, but the other task stays in {{TASK_KILLING}} state > forever > # Perform steps in (1). > # Restart the Mesos agent > *Expected result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, and kills the other task too. > *Actual result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, but the other task stays in > `TASK_KILLING` state forever. > Please note, that after another agent restart, the other tasks gets finally > killed and the correct status updates get propagated all the way to Marathon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart
[ https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8391: --- Attachment: agent.log.gz > Mesos agent doesn't notice that a pod task exits or crashes after the agent > restart > --- > > Key: MESOS-8391 > URL: https://issues.apache.org/jira/browse/MESOS-8391 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, executor >Affects Versions: 1.5.0 >Reporter: Ivan Chernetsky >Priority: Critical > Attachments: agent.log.gz > > > h4. (1) Agent doesn't detect that a pod task exits/crashes > # Create a Marathon pod with two containers which just do {{sleep 1}}. > # Restart the Mesos agent on the node the pod got launched. > # Kill one of the pod tasks > *Expected result*: The Mesos agent detects that one of the tasks got killed, > and forwards {{TASK_FAILED}} status to Marathon. > *Actual result*: The Mesos agent does nothing, and the Mesos master thinks > that both tasks are running just fine. Marathon doesn't take any action > because it doesn't receive any update from Mesos. > h4. (2) After the agent restart, it detects that the task crashed, forwards > the correct status update, but the other task stays in {{TASK_KILLING}} state > forever > # Perform steps in (1). > # Restart the Mesos agent > *Expected result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, and kills the other task too. > *Actual result*: The Mesos agent detects that one of the tasks got crashed, > forwards the corresponding status update, but the other task stays in > `TASK_KILLING` state forever. > Please note, that after another agent restart, the other tasks gets finally > killed and the correct status updates get propagated all the way to Marathon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8359) Health checks are flapping for all tasks on the slave if one task has no enough resources to run
[ https://issues.apache.org/jira/browse/MESOS-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316519#comment-16316519 ] Benno Evers commented on MESOS-8359: >From what I gather, the following conditions need to be met to reproduce: - The other tasks on the slave need to be health-checked by a `COMMAND`-type health check - Docker executor must be used for all launched tasks I'm also wondering which command was actually used for the command health check, and if the executor and/or master logs at the time the bug is observed show anything interesting? Finally, since I'm not very experienced with Marathon, can you give some more details on what exactly it means to "create a marathon application from your image"? > Health checks are flapping for all tasks on the slave if one task has no > enough resources to run > > > Key: MESOS-8359 > URL: https://issues.apache.org/jira/browse/MESOS-8359 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Viacheslav Valyavskiy > Attachments: logs2 > > > I have attached some logs from the affected > slave(newappmv_qagame_testapp.green_csahttp - name of the 'bad' application) > Steps to reproduce: > 1. Run multiple tasks on the slave > 2. Create marathon application from our image ( docker pull > vvalyavskiy/csa-http ) and set memory limit to 16MB for it. > 3. Wait some time and then observe flapping of all tasks on the slave where > our task is started -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.
[ https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320849#comment-16320849 ] Benno Evers commented on MESOS-8410: The issue was caused by an incorrect handling of multiple resources with the same name. I've opened a review with a fix at https://reviews.apache.org/r/65074/ > Reconfiguration policy fails to handle mount disk resources. > > > Key: MESOS-8410 > URL: https://issues.apache.org/jira/browse/MESOS-8410 > Project: Mesos > Issue Type: Bug >Reporter: James Peach >Assignee: Benno Evers > > We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos > agents that had mount disk resources configured, and it looks like the agent > confused the size of the mount disk with the size of the work directory > resource: > {noformat} > E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to > perform recovery: Configuration change not permitted under 'additive' policy: > Value of scalar resource 'disk' decreased from 183 to 868000 > {noformat} > The {{--resources}} flag is > {noformat} > --resources="[ > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 868000 > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/a" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/b" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/c" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/d" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/e" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/f" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/g" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/h" > } > } > } > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.
[ https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8410: --- Priority: Blocker (was: Major) > Reconfiguration policy fails to handle mount disk resources. > > > Key: MESOS-8410 > URL: https://issues.apache.org/jira/browse/MESOS-8410 > Project: Mesos > Issue Type: Bug >Reporter: James Peach >Assignee: Benno Evers >Priority: Blocker > > We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos > agents that had mount disk resources configured, and it looks like the agent > confused the size of the mount disk with the size of the work directory > resource: > {noformat} > E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to > perform recovery: Configuration change not permitted under 'additive' policy: > Value of scalar resource 'disk' decreased from 183 to 868000 > {noformat} > The {{--resources}} flag is > {noformat} > --resources="[ > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 868000 > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/a" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/b" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/c" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/d" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/e" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/f" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/g" > } > } > } > } > , > { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 183 > }, > "disk": { > "source": { > "type": "MOUNT", > "mount": { > "root" : "/srv/mesos/volumes/h" > } > } > } > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7944) Implement jemalloc support for Mesos
[ https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7944: --- Sprint: Mesosphere Sprint 63, Mesosphere Sprint 65, Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 72 (was: Mesosphere Sprint 63, Mesosphere Sprint 65, Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68) > Implement jemalloc support for Mesos > > > Key: MESOS-7944 > URL: https://issues.apache.org/jira/browse/MESOS-7944 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers > Labels: mesosphere > > After investigation in MESOS-7876 and discussion on the mailing list, this > task is for tracking progress on adding out-of-the-box memory profiling > support using jemalloc to Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6238) SSL / libevent support broken in IPv6 patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569370#comment-15569370 ] Benno Evers commented on MESOS-6238: Hm, the `url` seems pretty random. I can't remember putting it there for a specific reason, so I guess its some merge artifact from a previous revision. I pushed a new commit to github (d2d122ab057c93e9136577db5030f9976eb623c3) which fixes this issue, at least for me mesos now builds with --enable-ssl on ubuntu trusty and xenial. > SSL / libevent support broken in IPv6 patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6238 > URL: https://issues.apache.org/jira/browse/MESOS-6238 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche >Assignee: Benno Evers > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > make fails when configure options --enable-ssl --enable-libevent were given. > Error message: > {noformat} > ... > ... > ../../../3rdparty/libprocess/src/process.cpp: In member function ‘void > process::SocketManager::link_connect(const process::Future&, > process::network::Socket, const process::UPID&)’: > ../../../3rdparty/libprocess/src/process.cpp:1457:25: error: ‘url’ was not > declared in this scope >Try ip = url.ip; > ^ > Makefile:997: recipe for target 'libprocess_la-process.lo' failed > make[5]: *** [libprocess_la-process.lo] Error 1 > ... > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6237) Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569389#comment-15569389 ] Benno Evers commented on MESOS-6237: Hm, one place that definitely needs to be fixed is in master/http/http.cpp: Try hostname = info.has_hostname() ? info.hostname() : net::getHostname(net::IP(ntohl(info.ip(; However, this shouldn't affect the agent display if I understand the code correctly. Can I ask how you are getting a raw IP displayed in the mesos UI anyways? I found it hard to start an agent for testing purposes without mesos figuring out the hostname automatically, > Agent Sandbox inaccessible when using IPv6 address in patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6237 > URL: https://issues.apache.org/jira/browse/MESOS-6237 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche >Assignee: Benno Evers > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > When using IPs instead of hostnames the Agent Sandbox is inaccessible in the > Web UI. The problem seems to be that there's no brackets around the IP so it > tries to access e.g. http://2001:41d0:1000:ab9:::5051 instead of > http://[2001:41d0:1000:ab9::]:5051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6237) Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569389#comment-15569389 ] Benno Evers edited comment on MESOS-6237 at 10/12/16 5:46 PM: -- So, one place that definitely needs to be fixed is in master/http/http.cpp: Try hostname = info.has_hostname() ? info.hostname() : net::getHostname(net::IP(ntohl(info.ip(; However, this shouldn't affect the agent display if I understand the code correctly. Can I ask how you are getting a raw IP displayed in the mesos UI anyways? I found it hard to start an agent for testing purposes without mesos figuring out the hostname automatically, was (Author: bennoe): Hm, one place that definitely needs to be fixed is in master/http/http.cpp: Try hostname = info.has_hostname() ? info.hostname() : net::getHostname(net::IP(ntohl(info.ip(; However, this shouldn't affect the agent display if I understand the code correctly. Can I ask how you are getting a raw IP displayed in the mesos UI anyways? I found it hard to start an agent for testing purposes without mesos figuring out the hostname automatically, > Agent Sandbox inaccessible when using IPv6 address in patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6237 > URL: https://issues.apache.org/jira/browse/MESOS-6237 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche >Assignee: Benno Evers > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > When using IPs instead of hostnames the Agent Sandbox is inaccessible in the > Web UI. The problem seems to be that there's no brackets around the IP so it > tries to access e.g. http://2001:41d0:1000:ab9:::5051 instead of > http://[2001:41d0:1000:ab9::]:5051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4606) Add IPv6 support to net::IP and net::IPNetwork
Benno Evers created MESOS-4606: -- Summary: Add IPv6 support to net::IP and net::IPNetwork Key: MESOS-4606 URL: https://issues.apache.org/jira/browse/MESOS-4606 Project: Mesos Issue Type: Improvement Components: stout Reporter: Benno Evers Assignee: Benno Evers Priority: Minor The classes net::IP and net::IPNetwork should to be able to store IPv6 addresses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4606) Add IPv6 support to net::IP and net::IPNetwork
[ https://issues.apache.org/jira/browse/MESOS-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470145#comment-15470145 ] Benno Evers commented on MESOS-4606: Yes, an implementation is available at https://github.com/lava/mesos/commit/8b83489a5cd5e3fe81c98cae3dfe58a7e945376f There were no shepherds willing to take on this task, maybe this will change after a design document for the bigger issue (IPv6 support in mesos) is finished, which should be ready in the next few days to weeks. > Add IPv6 support to net::IP and net::IPNetwork > -- > > Key: MESOS-4606 > URL: https://issues.apache.org/jira/browse/MESOS-4606 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Minor > Labels: network, stout > > The classes net::IP and net::IPNetwork should to be able to store IPv6 > addresses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6237) Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-6237: -- Assignee: Benno Evers > Agent Sandbox inaccessible when using IPv6 address in patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6237 > URL: https://issues.apache.org/jira/browse/MESOS-6237 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche >Assignee: Benno Evers > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > When using IPs instead of hostnames the Agent Sandbox is inaccessible in the > Web UI. The problem seems to be that there's no brackets around the IP so it > tries to access e.g. http://2001:41d0:1000:ab9:::5051 instead of > http://[2001:41d0:1000:ab9::]:5051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6238) SSL / libevent support broken in IPv6 patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-6238: -- Assignee: Benno Evers > SSL / libevent support broken in IPv6 patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6238 > URL: https://issues.apache.org/jira/browse/MESOS-6238 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche >Assignee: Benno Evers > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > make fails when configure options --enable-ssl --enable-libevent were given. > Error message: > {noformat} > ... > ... > ../../../3rdparty/libprocess/src/process.cpp: In member function ‘void > process::SocketManager::link_connect(const process::Future&, > process::network::Socket, const process::UPID&)’: > ../../../3rdparty/libprocess/src/process.cpp:1457:25: error: ‘url’ was not > declared in this scope >Try ip = url.ip; > ^ > Makefile:997: recipe for target 'libprocess_la-process.lo' failed > make[5]: *** [libprocess_la-process.lo] Error 1 > ... > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-243) driver stop() should block until outstanding requests have been persisted
[ https://issues.apache.org/jira/browse/MESOS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-243: - Assignee: Benno Evers > driver stop() should block until outstanding requests have been persisted > - > > Key: MESOS-243 > URL: https://issues.apache.org/jira/browse/MESOS-243 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1, > 0.14.2, 0.15.0 >Reporter: brian wickman >Assignee: Benno Evers > > in our executor, we send a terminal status update message and immediately > call driver.stop(). it turns out that the status update is dispatched > asynchronously and races with driver shutdown, causing tasks to instead > periodically go into LOST state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-243) driver stop() should block until outstanding requests have been persisted
[ https://issues.apache.org/jira/browse/MESOS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-243: -- Assignee: Vladimir Petrovic (was: Benno Evers) > driver stop() should block until outstanding requests have been persisted > - > > Key: MESOS-243 > URL: https://issues.apache.org/jira/browse/MESOS-243 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1, > 0.14.2, 0.15.0 >Reporter: brian wickman >Assignee: Vladimir Petrovic > > in our executor, we send a terminal status update message and immediately > call driver.stop(). it turns out that the status update is dispatched > asynchronously and races with driver shutdown, causing tasks to instead > periodically go into LOST state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-8450) SlaveInfo comparison is unnecessarily expensive
Benno Evers created MESOS-8450: -- Summary: SlaveInfo comparison is unnecessarily expensive Key: MESOS-8450 URL: https://issues.apache.org/jira/browse/MESOS-8450 Project: Mesos Issue Type: Bug Reporter: Benno Evers Currently, the comparison operator of `struct SlaveInfo` is creating two temporary `Resources` objects and two temporary `Attributes` objects. All of these constructors do a bunch of work and allocate memory. Instead of passing around `SlaveInfo` in the master, we should probably use some wrapper that stores the raw message as well as caching the lazily generated `Resources` and `Attributes` objects associated with that `SlaveInfo`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8451) Unhandled Interference between registration and reregistration code paths
Benno Evers created MESOS-8451: -- Summary: Unhandled Interference between registration and reregistration code paths Key: MESOS-8451 URL: https://issues.apache.org/jira/browse/MESOS-8451 Project: Mesos Issue Type: Bug Reporter: Benno Evers Right now, the code paths for agent registration and agent re-registration run independent of each other, probably on the assumption that re-registration requires an agent ID from the master which is only given out after successful registration, so the code paths cannot interfere. However, it is not so hard to construct some examples where this fails, e.g. - Agent sends out registration message 1 - Timeout expires, agent sends out registration message 2 - Agent gets registration message 1, updates agent id, is restarted - Agent send reregistration message 1 after restart Most likely, a proper solution will require to introduce some kind of counter or uuid to the (re-)registration messages, which is also required for proper handling of multiple reregistration messages as described in MESOS-8273. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8452) Prevent zero-length timeout for exponential backoff
Benno Evers created MESOS-8452: -- Summary: Prevent zero-length timeout for exponential backoff Key: MESOS-8452 URL: https://issues.apache.org/jira/browse/MESOS-8452 Project: Mesos Issue Type: Bug Reporter: Benno Evers The current implementation of exponential backoff for registration attempts in the agent seems to have a high probability of generating zero-length timeouts, producing registration attempts that the master has no chance of responding in time. Most likely, a minimum time between attemps should be introduced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8482) Signed/Unsigned comparisons in tests
Benno Evers created MESOS-8482: -- Summary: Signed/Unsigned comparisons in tests Key: MESOS-8482 URL: https://issues.apache.org/jira/browse/MESOS-8482 Project: Mesos Issue Type: Bug Reporter: Benno Evers Many tests in mesos currently have comparisons between signed and unsigned integers, eg {noformat} ASSERT_EQ(4, v1Response->read_file().size()); {noformat} or comparisons between values of different enums, e.g. TaskState and v1::TaskState: {noformat} ASSERT_EQ(TASK_STARTING, startingUpdate->status().state()); {noformat} Usually, the compiler would catch these and emit a warning, but these are currently silenced because gtest headers are included using the `-isystem` command line flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8485) MasterTest.RegistryGcByCount is flaky
[ https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-8485: -- Assignee: Benno Evers > MasterTest.RegistryGcByCount is flaky > - > > Key: MESOS-8485 > URL: https://issues.apache.org/jira/browse/MESOS-8485 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Vinod Kone >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed this while testing Mesos 1.5.0-rc1 in ASF CI. > > {code} > 3: [ RUN ] MasterTest.RegistryGcByCount > ..snip... > 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master > 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master > master@172.17.0.2:45634 > 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 > authenticatee > 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client > SASL connection > 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating > slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication > session for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL > connection > 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL > authentication mechanisms: CRAM-MD5 > 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to > authenticate with mechanism 'CRAM-MD5' > 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL > authentication start > 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires > more steps > 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL > authentication step > 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL > authentication step > 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: false > 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property > '*userPassword' > 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property > '*cmusaslsecretCRAM-MD5' > 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: true > 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property > '*userPassword' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property > '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success > 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success > 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated > principal 'test-principal' at slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session > cleanup for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated > with master master@172.17.0.2:45634 > 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in > 2.234083ms if necessary > 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent > message from slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with > principal 'test-principal' > 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of > agent at slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at > slave(442)@172.17.0.2:45634 (455912973e2c) with id > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 > 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in > 227911ns; attempting to update the registry > 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the > registry in 743168ns > 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) > 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] > 3: I0123 19:22:05.939159 16002 slave.cpp:1764] Will retry registration in > 26.332876ms if necessary >
[jira] [Commented] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338220#comment-16338220 ] Benno Evers commented on MESOS-8484: In boost 1.53, lexical_cast implements its own parser that doesnt handle the '0x' prefix, therefore parsing the two strings in the test would return an error. In boost 1.65, lexical_cast calls std::istream::operator>>, which on mac (i.e. using libc++) can successfully parse strings of the form "0x10.9" or "0x1p-5", and returns the correct number. On linux platforms (i.e. using libstdc++), std::istream::operator>> is not able to parse these strings and thus returns an error. The function stout::numify wants to achieve platform independence by forbidding these kinds of literals on all platforms. However, the checks are only happening *after* boost was already given the chance to parse the string, which has platform-dependent behaviour. > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.37) > ../configure && make check -j6 >Reporter: Till Toenshoff >Assignee: Benjamin Bannier >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339357#comment-16339357 ] Benno Evers commented on MESOS-7699: Updated review chain after fixing a bunch of other issues blocking this: https://reviews.apache.org/r/62447/ > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile >Assignee: Benno Evers >Priority: Major > Labels: autotools > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339357#comment-16339357 ] Benno Evers edited comment on MESOS-7699 at 1/25/18 3:25 PM: - Updated review chain after fixing a bunch of other issues blocking this: [https://reviews.apache.org/r/62447/] [https://reviews.apache.org/r/65289/] [https://reviews.apache.org/r/65290/] was (Author: bennoe): Updated review chain after fixing a bunch of other issues blocking this: https://reviews.apache.org/r/62447/ > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile >Assignee: Benno Evers >Priority: Major > Labels: autotools > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky
[ https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341216#comment-16341216 ] Benno Evers commented on MESOS-8485: This is fairly reproducible when putting the test machine under heavy load (i.e. ca. 1 failure per 3000 runs when I'm compiling Mesos with 24 threads at the same time) What happens is the following: The test case is starting two different instances of `mesos-agent`, marking both of them as gone, and forcing one of them to be garbage collected. It expects that after this is done, one of the slaves will be marked as "gone" and the other be unknown. To get the agent id of the agents it registers, the following code is used: {noformat} Future slaveRegisteredMessage = FUTURE_PROTOBUF(SlaveRegisteredMessage(), master.get()->pid, _); Try> slave = StartSlave(detector.get(), slaveFlags); AWAIT_READY(slaveRegisteredMessage); [...] (the slave is marked as gone here) Future slaveRegisteredMessage2 = FUTURE_PROTOBUF(SlaveRegisteredMessage(), master.get()->pid, _); Try> slave2 = StartSlave(detector.get(), slaveFlags2); AWAIT_READY(slaveRegisteredMessage2);{noformat} In the failure case, the registration of the first agent works as follows: {noformat} agent0: Sends RegisterSlaveMessage master: Does registration, adds SlaveRegisteredMessage to outbound message queue agent0: Didn't get an answer after timeout, resends RegisterSlaveMessage agent0: Gets the previously sent SlaveRegisteredMessage master: Gets the second RegisterSlaveMessage, notices that agent0 is already registered and just resends the Slave test: Proceeds to mark agent0 as gone, creates the Future for agent1 test: The future is satisfied by the second SlaveRegisteredMessage sent by the master{noformat} Leading the test code to think that agent1 has the agent id of agent0, which leads to the subsequent test failure. Mesos basically works correctly here, so the correct fix seems to be to rewrite the test to wait for a `SlaveRegisteredMessage` that is actually destined for the correct pid. > MasterTest.RegistryGcByCount is flaky > - > > Key: MESOS-8485 > URL: https://issues.apache.org/jira/browse/MESOS-8485 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Vinod Kone >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed this while testing Mesos 1.5.0-rc1 in ASF CI. > > {code} > 3: [ RUN ] MasterTest.RegistryGcByCount > ..snip... > 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master > 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master > master@172.17.0.2:45634 > 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 > authenticatee > 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client > SASL connection > 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating > slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication > session for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL > connection > 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL > authentication mechanisms: CRAM-MD5 > 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to > authenticate with mechanism 'CRAM-MD5' > 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL > authentication start > 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires > more steps > 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL > authentication step > 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL > authentication step > 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: false > 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property > '*userPassword' > 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property > '*cmusaslsecretCRAM-MD5' > 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: true > 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property > '*userPassword' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property
[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky
[ https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341377#comment-16341377 ] Benno Evers commented on MESOS-8485: Review posted at: https://reviews.apache.org/r/65354 > MasterTest.RegistryGcByCount is flaky > - > > Key: MESOS-8485 > URL: https://issues.apache.org/jira/browse/MESOS-8485 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Vinod Kone >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed this while testing Mesos 1.5.0-rc1 in ASF CI. > > {code} > 3: [ RUN ] MasterTest.RegistryGcByCount > ..snip... > 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master > 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master > master@172.17.0.2:45634 > 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 > authenticatee > 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client > SASL connection > 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating > slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication > session for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL > connection > 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL > authentication mechanisms: CRAM-MD5 > 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to > authenticate with mechanism 'CRAM-MD5' > 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL > authentication start > 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires > more steps > 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL > authentication step > 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL > authentication step > 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: false > 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property > '*userPassword' > 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property > '*cmusaslsecretCRAM-MD5' > 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: true > 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property > '*userPassword' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property > '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success > 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success > 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated > principal 'test-principal' at slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session > cleanup for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated > with master master@172.17.0.2:45634 > 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in > 2.234083ms if necessary > 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent > message from slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with > principal 'test-principal' > 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of > agent at slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at > slave(442)@172.17.0.2:45634 (455912973e2c) with id > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 > 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in > 227911ns; attempting to update the registry > 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the > registry in 743168ns > 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) > 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] > 3: I0123 19:22:05.939159
[jira] [Created] (MESOS-8508) Missing map header when compiling against unbundled protobuf
Benno Evers created MESOS-8508: -- Summary: Missing map header when compiling against unbundled protobuf Key: MESOS-8508 URL: https://issues.apache.org/jira/browse/MESOS-8508 Project: Mesos Issue Type: Bug Reporter: Benno Evers When compiling mesos against the system-default version of protobuf on Ubuntu 17.04, the build fails due to a missing include. Explanation for the error by [~kaysoky]: Note that the reason why this doesn't compile in protobuf 3.0.x is due to how the c++ files are generated. In protobuf 3.0.x (and 3.1.x and 3.2.x) generated code only includes the protobuf map headers if there is a map present in the .proto file:[https://github.com/google/protobuf/blob/3.0.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L817-L827] >From 3.3.x onwards, all generated files include >{{google/protobuf/generated_message_table_driven.h}}, which in turn includes >the map >headers:[https://github.com/google/protobuf/blob/3.3.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L1006] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-3915) Upgrade vendored Boost
[ https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-3915: --- Sprint: Mesosphere Sprint 73 Story Points: 5 > Upgrade vendored Boost > -- > > Key: MESOS-3915 > URL: https://issues.apache.org/jira/browse/MESOS-3915 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Benno Evers >Priority: Minor > Labels: boost, mesosphere, tech-debt > Fix For: 1.6.0 > > > We should upgrade the vendored version of Boost to a newer version. Benefits: > * -Should properly fix MESOS-688- > * -Should fix MESOS-3799- > * Generally speaking, using a more modern version of Boost means we can take > advantage of bug fixes, optimizations, and new features. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)
[ https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-7699: --- Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 73 (was: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68) > "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable > freshly released) > --- > > Key: MESOS-7699 > URL: https://issues.apache.org/jira/browse/MESOS-7699 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.2.0 >Reporter: Adam Cecile >Assignee: Benno Evers >Priority: Major > Labels: autotools > Fix For: 1.6.0 > > > Hi, > It seems the issue comes from a workaround added a while ago: > https://reviews.apache.org/r/40326/ > https://reviews.apache.org/r/40327/ > When building with external libraries it turns out creating build commands > line with -isystem /usr/include which is clearly stated as being wrong, > according to GCC guys: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129 > I'll do some testing by reverting all -isystem to -I and I'll let it know if > it gets built. > Regards, Adam. > {noformat} > configure:21642: result: no > configure:21642: checking glog/logging.h presence > configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 > -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include > -I/usr/include conftest.cpp > In file included from /usr/include/c++/6/ext/string_conversions.h:41:0, > from /usr/include/c++/6/bits/basic_string.h:5417, > from /usr/include/c++/6/string:52, > from /usr/include/c++/6/bits/locale_classes.h:40, > from /usr/include/c++/6/bits/ios_base.h:41, > from /usr/include/c++/6/ios:42, > from /usr/include/c++/6/ostream:38, > from /usr/include/glog/logging.h:43, > from conftest.cpp:32: > /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or > directory > #include_next > ^ > compilation terminated. > configure:21642: $? = 1 > configure: failed program was: > | /* confdefs.h */ > | #define PACKAGE_NAME "mesos" > | #define PACKAGE_TARNAME "mesos" > | #define PACKAGE_VERSION "1.2.0" > | #define PACKAGE_STRING "mesos 1.2.0" > | #define PACKAGE_BUGREPORT "" > | #define PACKAGE_URL "" > | #define PACKAGE "mesos" > | #define VERSION "1.2.0" > | #define STDC_HEADERS 1 > | #define HAVE_SYS_TYPES_H 1 > | #define HAVE_SYS_STAT_H 1 > | #define HAVE_STDLIB_H 1 > | #define HAVE_STRING_H 1 > | #define HAVE_MEMORY_H 1 > | #define HAVE_STRINGS_H 1 > | #define HAVE_INTTYPES_H 1 > | #define HAVE_STDINT_H 1 > | #define HAVE_UNISTD_H 1 > | #define HAVE_DLFCN_H 1 > | #define LT_OBJDIR ".libs/" > | #define HAVE_CXX11 1 > | #define HAVE_PTHREAD_PRIO_INHERIT 1 > | #define HAVE_PTHREAD 1 > | #define HAVE_LIBZ 1 > | #define HAVE_FTS_H 1 > | #define HAVE_APR_POOLS_H 1 > | #define HAVE_LIBAPR_1 1 > | #define HAVE_BOOST_VERSION_HPP 1 > | #define HAVE_LIBCURL 1 > | /* end confdefs.h. */ > | #include > configure:21642: result: no > configure:21642: checking for glog/logging.h > configure:21642: result: no > configure:21674: error: cannot find glog > --- > You have requested the use of a non-bundled glog but no suitable > glog could be found. > You may want specify the location of glog by providing a prefix > path via --with-glog=DIR, or check that the path you provided is > correct if you're already doing this. > --- > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8508) Missing map header when compiling against unbundled protobuf
[ https://issues.apache.org/jira/browse/MESOS-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers updated MESOS-8508: --- Sprint: Mesosphere Sprint 73 Story Points: 1 > Missing map header when compiling against unbundled protobuf > > > Key: MESOS-8508 > URL: https://issues.apache.org/jira/browse/MESOS-8508 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Fix For: 1.6.0 > > > When compiling mesos against the system-default version of protobuf on Ubuntu > 17.04, the build fails due to a missing include. > > Explanation for the error by [~kaysoky]: > Note that the reason why this doesn't compile in protobuf 3.0.x is due to how > the c++ files are generated. In protobuf 3.0.x (and 3.1.x and 3.2.x) > generated code only includes the protobuf map headers if there is a map > present in the .proto > file:[https://github.com/google/protobuf/blob/3.0.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L817-L827] > From 3.3.x onwards, all generated files include > {{google/protobuf/generated_message_table_driven.h}}, which in turn includes > the map > headers:[https://github.com/google/protobuf/blob/3.3.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L1006] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky
[ https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352625#comment-16352625 ] Benno Evers commented on MESOS-8485: [~abudnik] I did not attempt to do that, because it requires suspending and resuming most of the involved individual processes separately, and as far as I'm aware our test tools don't provide such fine-grained control. > MasterTest.RegistryGcByCount is flaky > - > > Key: MESOS-8485 > URL: https://issues.apache.org/jira/browse/MESOS-8485 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Vinod Kone >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed this while testing Mesos 1.5.0-rc1 in ASF CI. > > {code} > 3: [ RUN ] MasterTest.RegistryGcByCount > ..snip... > 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master > 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master > master@172.17.0.2:45634 > 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 > authenticatee > 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client > SASL connection > 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating > slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication > session for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL > connection > 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL > authentication mechanisms: CRAM-MD5 > 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to > authenticate with mechanism 'CRAM-MD5' > 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL > authentication start > 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires > more steps > 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL > authentication step > 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL > authentication step > 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: false > 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property > '*userPassword' > 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property > '*cmusaslsecretCRAM-MD5' > 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: true > 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property > '*userPassword' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property > '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success > 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success > 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated > principal 'test-principal' at slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session > cleanup for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated > with master master@172.17.0.2:45634 > 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in > 2.234083ms if necessary > 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent > message from slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with > principal 'test-principal' > 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of > agent at slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at > slave(442)@172.17.0.2:45634 (455912973e2c) with id > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 > 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in > 227911ns; attempting to update the registry > 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the > registry in 743168ns > 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) > 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent > ee
[jira] [Commented] (MESOS-8359) Health checks are flapping for all tasks on the slave if one task has no enough resources to run
[ https://issues.apache.org/jira/browse/MESOS-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354054#comment-16354054 ] Benno Evers commented on MESOS-8359: I'm afraid I cant reproduce this: - I started a task `python3 -m http.server` along with a `MESOS_HTTP` health check that is succeeding - I started the `csa-http` container from the JSON app definition supplied above. - This fails almost immediately with the log output {noformat} I0206 15:35:04.468765 21810 exec.cpp:162] Version: 1.5.0 I0206 15:35:04.480106 21817 exec.cpp:236] Executor registered on agent c4c3a4b7-afa1-4e8d-a723-51777de3d429-S1 I0206 15:35:04.480873 21813 executor.cpp:120] Registered docker executor on 10.0.3.249 I0206 15:35:04.481027 21817 executor.cpp:160] Starting task newappmv_qagame_testapp.green_csahttp.4cb3e8b2-0b53-11e8-b4d2-d6e8ac5e6a60 Picked up JAVA_TOOL_OPTIONS: -Xmx32m Killed I0206 15:35:12.424013 21814 executor.cpp:552] Container exited with status 137 I0206 15:35:13.424458 21810 checker_process.cpp:247] Stopped HTTP health check for task 'newappmv_qagame_testapp.green_csahttp.4cb3e8b2-0b53-11e8-b4d2-d6e8ac5e6a60'{noformat} * After waiting for 15m, the health check for the python3 task is not flapping but stably succeding Is it maybe possible to further reduce the test case? In particular, do you still observe the same behaviour if you remove docker and marathon from the picture by just starting the jar directly on the slave node? > Health checks are flapping for all tasks on the slave if one task has no > enough resources to run > > > Key: MESOS-8359 > URL: https://issues.apache.org/jira/browse/MESOS-8359 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Viacheslav Valyavskiy >Priority: Major > Attachments: logs2 > > > I have attached some logs from the affected > slave(newappmv_qagame_testapp.green_csahttp - name of the 'bad' application) > Steps to reproduce: > 1. Run multiple tasks on the slave > 2. Create marathon application from our image ( docker pull > vvalyavskiy/csa-http ) and set memory limit to 16MB for it. > 3. Wait some time and then observe flapping of all tasks on the slave where > our task is started -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`
[ https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357207#comment-16357207 ] Benno Evers commented on MESOS-8550: Andrei's analysis seems to be right, the code was indeed calling `leader->has_domain()` on an `Option` without checking that it was not `None` first. I posted a fix in in the following review: https://reviews.apache.org/r/65571/ > Bug in `Master::detected()` leads to coredump in > `MasterZooKeeperTest.MasterInfoAddress` > > > Key: MESOS-8550 > URL: https://issues.apache.org/jira/browse/MESOS-8550 > Project: Mesos > Issue Type: Bug > Components: leader election, master >Reporter: Andrei Budnik >Priority: Major > Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt > > > {code:java} > 15:55:17 Assertion failed: (isSome()), function get, file > ../../3rdparty/stout/include/stout/option.hpp, line 119. > 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if > you are using GNU date *** > 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill > 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID > 0x70427000) stack trace: *** > 15:55:17 @ 0x7fff4fa24f5a _sigtramp > 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session > expired > 15:55:17 @ 0x70425500 (unknown) > 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: > Client > environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU > 15:55:17 @ 0x7fff4f84f312 abort > 15:55:17 2018-02-07 > 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating > client connection, host=127.0.0.1:52197 sessionTimeout=1 > watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 > flags=0 > 15:55:17 @ 0x7fff4f817368 __assert_rtn > 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv > 15:55:17 @0x10bbb04b5 Option<>::operator->() > 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected() > 15:55:17 @0x10bf54558 > _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_ > 15:55:17 @0x10bf54310 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_ > 15:55:17 @0x10bf542bb > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_ > 15:55:17 @0x10bf541f3 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_ > 15:55:17 @0x10bf540bd > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_6FutureI6OptionINS6_10MasterInfoSG_EEvRKNS4_3PIDIT_EEMSI_FvT0_EOT1_EUlOSE_PNS4_11ProcessBaseEE_JSE_NSt3__112placeholders4__phILi1EEJST_EEEDTclclsr3stdE7forwardISI_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSI_DpOS10_ > 15:55:17 @0x10bf54081 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_6FutureI6OptionINS7_10MasterInfoSH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_PNS5_11ProcessBaseEE_JSF_NSt3__112placeholders4__phILi1EEJSU_EEEvOSJ_DpOT0_ > 15:55:17 @0x10bf53e06 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINSA_10MasterInfoSK_EEvRKNS1_3PIDIT_EEMSM_FvT0_EOT1_EUlOSI_S3_E_JSI_NSt3__112placeholders4__phILi1EEEclEOS3_ > 15:55:17 @0x10ebf464f > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ > 15:55:17 @0x10ebf44c4 process::ProcessBase:
[jira] [Commented] (MESOS-8594) Mesos master crash (under load)
[ https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369238#comment-16369238 ] Benno Evers commented on MESOS-8594: The analysis by [~abudnik] seems to be correct, the actual site of the crash looks completely harmless with no dangling pointers or anything, and the call stack is very deep, going repeatedly through `process::internal::send()` and `process::internal::_send()`. (although The root cause seems to be this ancient TODO in `Future::onAny()` {noformat} synchronized (data->lock) { if (data->state == PENDING) { data->onAnyCallbacks.emplace_back(std::move(callback)); } else { run = true; } } // TODO(*): Invoke callback in another execution context. if (run) { std::move(callback)(*this); // NOLINT(misc-use-after-move) }{noformat} so whenever we arrive in `send()` and the future returned by the socket is already finished, we add another 5-10 functions to the stack frame. Most likely, due the large number of big packets being sent over a loopback interface, there is always enough data to allow a large enough build-up to cause the program to run out of stack space. > Mesos master crash (under load) > --- > > Key: MESOS-8594 > URL: https://issues.apache.org/jira/browse/MESOS-8594 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.6.0 >Reporter: A. Dukhovniy >Priority: Major > Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, > lldb-regiser-read.txt > > > Mesos master crashes under load. Attached are some infos from the `lldb`: > {code:java} > Process 41933 resuming > Process 41933 stopped > * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8) > frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35 > 32 template > 33 struct _Some > 34 { > -> 35 _Some(T _t) : t(std::move(_t)) {} > 36 > 37 T t; > 38 }; > Target 0: (mesos-master) stopped. > (lldb) > {code} > To quote [~abudnik] > {quote} > it’s the stack overflow bug in libprocess due to a way `internal::send()` and > `internal::_send()` are implemented in `process.cpp` > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky
[ https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371439#comment-16371439 ] Benno Evers commented on MESOS-8336: The root cause here is a very familiar one, that has already rendered countless other tests flaky. In particular, I'm talking about this line in `slave.cpp`: {noformat} // Wait for a random amount of time before authentication or // registration. Duration duration = flags.registration_backoff_factor * ((double) os::random() / RAND_MAX);{noformat} Here, the agent is sending the re-tried `RegisterSlaveMessage` after 9ms, *just* before shutting down, and the master notices that the network link is down before it gets to processing the message. This leads to the master assigning a second slave ID, almost immediately removing the slave again because the network link is broken as well, and finally the test seeing the remnants of this second slave in the registry. > MasterTest.RegistryUpdateAfterReconfiguration is flaky > -- > > Key: MESOS-8336 > URL: https://issues.apache.org/jira/browse/MESOS-8336 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: flaky-test > Attachments: RegistryUpdateAfterReconfiguration-badrun.txt > > > Observed here: > https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/ > The test here failed because the registry contained 2 slaves, when it should > have only one. > Looking through the log, everything seems normal (in particular, only 1 slave > id appears throughout this test). The only thing out of the ordinary seems to > be the agent sending two `RegisterSlaveMessage`s and two > `ReregisterSlaveMessage`s, but looking at the code for generating the random > backoff factor in the slave that seems to be more or less normal, and > shouldn't break the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky
[ https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373066#comment-16373066 ] Benno Evers commented on MESOS-8336: https://reviews.apache.org/r/65758/ > MasterTest.RegistryUpdateAfterReconfiguration is flaky > -- > > Key: MESOS-8336 > URL: https://issues.apache.org/jira/browse/MESOS-8336 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: flaky-test > Attachments: RegistryUpdateAfterReconfiguration-badrun.txt > > > Observed here: > https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/ > The test here failed because the registry contained 2 slaves, when it should > have only one. > Looking through the log, everything seems normal (in particular, only 1 slave > id appears throughout this test). The only thing out of the ordinary seems to > be the agent sending two `RegisterSlaveMessage`s and two > `ReregisterSlaveMessage`s, but looking at the code for generating the random > backoff factor in the slave that seems to be more or less normal, and > shouldn't break the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky
[ https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-8336: -- Assignee: Benno Evers > MasterTest.RegistryUpdateAfterReconfiguration is flaky > -- > > Key: MESOS-8336 > URL: https://issues.apache.org/jira/browse/MESOS-8336 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > Attachments: RegistryUpdateAfterReconfiguration-badrun.txt > > > Observed here: > https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/ > The test here failed because the registry contained 2 slaves, when it should > have only one. > Looking through the log, everything seems normal (in particular, only 1 slave > id appears throughout this test). The only thing out of the ordinary seems to > be the agent sending two `RegisterSlaveMessage`s and two > `ReregisterSlaveMessage`s, but looking at the code for generating the random > backoff factor in the slave that seems to be more or less normal, and > shouldn't break the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8600) Add more permissive reconfiguration policies
Benno Evers created MESOS-8600: -- Summary: Add more permissive reconfiguration policies Key: MESOS-8600 URL: https://issues.apache.org/jira/browse/MESOS-8600 Project: Mesos Issue Type: Improvement Reporter: Benno Evers With Mesos 1.5, the `reconfiguration_policy` flag was added to allow reconfiguration of agents without necessarily draining all tasks. However, the current implementation only allows a limited set of changes, with the `–reconfiguration_policy=all` setting laid out in the original design doc not yet being implemented. This ticket is intended to track progress on implementing this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8704) Removing `work_dir` can trigger assertion failure in the mesos containerizer
Benno Evers created MESOS-8704: -- Summary: Removing `work_dir` can trigger assertion failure in the mesos containerizer Key: MESOS-8704 URL: https://issues.apache.org/jira/browse/MESOS-8704 Project: Mesos Issue Type: Bug Reporter: Benno Evers This was reported to me by [~jeschkies], so I might be missing some details. After starting a Mesos agent with the flag `–containerizer=mesos,docker` and using Marathon to run a task group on this agent, then stopping the agent and removing the `work_dir` folder, and then restarting the agent with the flag `–containerizer=mesos` leads to the following crash during recovery: {noformat} I0319 15:58:03.865108 121364480 containerizer.cpp:674] Recovering containerizer F0319 15:58:03.867717 121364480 containerizer.cpp:919] CHECK_SOME(container->directory): is NONE *** Check failure stack trace: ***{noformat} After a reboot, things seemed to be working fine again. Since we're reading container id's from `runtime_dir` during recovery, and that wasn't cleaned between agent restarts, it seems like we're missing some validation for the case where the agent restarts from a half-dirty state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8703) Mesos master can`t reconnect to zookeeper
[ https://issues.apache.org/jira/browse/MESOS-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408442#comment-16408442 ] Benno Evers commented on MESOS-8703: The original zookeeper crash might well be caused by MESOS-8550. However, usually this should just result in a crash and subsequent restart of the master. Instead, the master seems to lock up during shutdown. The cause might be a similar issue as in MESOS-1477, although I couldn't see any suspicious changes to the related files for version 1.4.1. If this issue is somewhat reproducible, it would probably be helpful to include stack traces for all threads when the master becomes unresponsive. > Mesos master can`t reconnect to zookeeper > -- > > Key: MESOS-8703 > URL: https://issues.apache.org/jira/browse/MESOS-8703 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 >Reporter: Anton Malevich >Priority: Blocker > > Mesos master can`t reconnect to zookeeper after zookeeper hangs. > {noformat} > 2018-03-20 > 10:16:45,608:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1666: Socket > [:2181] zk retcode=-7, errno=110(Connection timed out): connection > to :2181 timed out (exceeded timeout by 3ms) > 2018-03-20 10:16:45,609:1(0x2ae675db6700):ZOO_INFO@check_events@1728: > initiated connection to server [:2181] > 2018-03-20 > 10:16:45,619:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1764: Socket > [:2181] zk retcode=-112, errno=116(Stale file handle): > sessionId=0x5623d0e483dd435 has expired. > I0320 10:16:45.62060418 group.cpp:511] ZooKeeper session expired > I0320 10:16:45.62080216 detector.cpp:152] Detected a new leader: None > I0320 10:16:45.62095716 master.cpp:2176] The newly elected leader is None > mesos-master: ../../3rdparty/stout/include/stout/option.hpp:112: T& > Option::get() & [with T = mesos::MasterInfo]: Assertion `isSome()' failed. > *** Aborted at 1521541005 (unix time) try "date -d @1521541005" if you are > using GNU date *** > PC: @ 0x2ae63d2b9428 (unknown) > *** SIGABRT (@0x1) received by PID 1 (TID 0x2ae648ffa700) from PID 1; stack > trace: *** > @ 0x2ae63d078390 (unknown) > @ 0x2ae63d2b9428 (unknown) > @ 0x2ae63d2bb02a (unknown) > @ 0x2ae63d2b1bd7 (unknown) > @ 0x2ae63d2b1c82 (unknown) > 2018-03-20 10:16:45,622:1(0x2ae649ffc700):ZOO_INFO@zookeeper_close@2543: > Freeing zookeeper resources for sessionId=0x5623d0e483dd435 > 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@730: Client > environment:host.name= > 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@737: Client > environment:os.name=Linux > 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@738: Client > environment:os.arch=4.8.15-1.el7.wg.x86_64 > 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@739: Client > environment:os.version=#1 SMP Mon Dec 26 14:34:45 UTC 2016 > 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@747: Client > environment:user.name=(null) > 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@755: Client > environment:user.home=/root > 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@767: Client > environment:user.dir=/ > 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@zookeeper_init@800: > Initiating client connection, host= sessionTimeout=1 > watcher=0x2ae63b3711e0 sessionId=0 sessionPasswd= > context=0x2ae6900036f8 flags=0 > @ 0x2ae63ad6b55b mesos::internal::master::Master::detected() > @ 0x2ae63b9e4cfc process::ProcessBase::visit() > 2018-03-20 10:16:45,634:1(0x2ae6765b7700):ZOO_INFO@check_events@1728: > initiated connection to server [:2181] > @ 0x2ae63b9fac84 process::ProcessManager::resume() > @ 0x2ae63b9fd5e6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x2ae63c87ec80 (unknown) > @ 0x2ae63d06e6ba start_thread > @ 0x2ae63d38b3dd (unknown) > 2018-03-20 10:16:45,651:1(0x2ae6765b7700):ZOO_INFO@check_events@1775: session > establishment complete on server [:2181], > sessionId=0x1623f43348692c7, negotiated timeout=1 > I0320 10:16:45.65168415 group.cpp:341] Group process > (zookeeper-group(2)@:5050) connected to ZooKeeper > I0320 10:16:45.65173315 group.cpp:831] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0320 10:16:45.65174315 group.cpp:419] Trying to create path '/mesos' in > ZooKeeper > I0320 10:16:45.67673615 detector.cpp:152] Detected a new leader: > (id='704') > I0320 10:16:45.67684415 group.cpp:700] Trying to get > '/mesos/json.info
[jira] [Created] (MESOS-8721) Unnecessary cropping of agent id's in the web ui
Benno Evers created MESOS-8721: -- Summary: Unnecessary cropping of agent id's in the web ui Key: MESOS-8721 URL: https://issues.apache.org/jira/browse/MESOS-8721 Project: Mesos Issue Type: Bug Reporter: Benno Evers Attachments: cropped_ids.png As seen in the attached image (captured from Firefox 59 and Mesos 1.2.3), the agents page of the web ui appears to be cropping agent ids even if the column would have enough space to display the full name. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8722) Hard-coded timeout for authentication failures
Benno Evers created MESOS-8722: -- Summary: Hard-coded timeout for authentication failures Key: MESOS-8722 URL: https://issues.apache.org/jira/browse/MESOS-8722 Project: Mesos Issue Type: Bug Reporter: Benno Evers In the mesos agent there is a hard-coded 5 second timeout for any authentication attempt: {noformat} void Slave::authenticate() { [...] delay(Seconds(5), self(), &Self::authenticationTimeout, authenticating.get()); } {noformat} When the network is poor, this can lead to the situation where an agent doesn't get to authorize for a long time, preventing it from re-joining the cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8724) G++ Warning about libc system macros `major` and `minor` prevents Mesos build
Benno Evers created MESOS-8724: -- Summary: G++ Warning about libc system macros `major` and `minor` prevents Mesos build Key: MESOS-8724 URL: https://issues.apache.org/jira/browse/MESOS-8724 Project: Mesos Issue Type: Bug Reporter: Benno Evers On linux systems, the header `` defines three macros called makedev(), major() and minor(). (See also http://man7.org/linux/man-pages/man3/makedev.3.html) Trying to compile Mesos using g++ 7.2.0 leads to the following warning: {noformat} ../include/csi/csi.pb.h:6042:13: error: In the GNU C Library, "minor" is defined by . For historical compatibility, it is currently defined by as well, but we plan to remove this soon. To use "minor", include directly. If you did not intend to use a system-defined macro "minor", you should undefine it after including . [-Werror] inline ::google::protobuf::uint32 Version::minor() const { {noformat} The root cause is that csi.proto defines the following protobuf message: {noformat} message Version { uint32 major = 1; // This field is REQUIRED. uint32 minor = 2; // This field is REQUIRED. uint32 patch = 3; // This field is REQUIRED. } {noformat} The generated C++ in `csi.pb.h` headers will contain, amongst others, the following function: {noformat} #include // [6000 lines of code...] inline ::google::protobuf::uint32 Version::major() const { // @@protoc_insertion_point(field_get:csi.Version.major) return major_; } {noformat} And the recursive include structure of the header `` leads to `stdlib.h` as follows: {noformat} . /usr/include/c++/7/string .. /usr/include/c++/7/bits/basic_string.h ... /usr/include/c++/7/ext/string_conversions.h /usr/include/c++/7/cstdlib . /usr/include/stdlib.h{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8724) G++ Warning about libc system macros `major` and `minor` prevents Mesos build
[ https://issues.apache.org/jira/browse/MESOS-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409916#comment-16409916 ] Benno Evers commented on MESOS-8724: One subtle thing to keep in mind, if we decide to "properly" fix it by getting protoc to add the correct #undef's for minor and major, we should take care to *not* backport the patch to older mesos versions, since that would remove the previously defined function `csi::Version::gnu_dev_major()`, causing ABI incompatibility for people upgrading libmesos.so. > G++ Warning about libc system macros `major` and `minor` prevents Mesos build > - > > Key: MESOS-8724 > URL: https://issues.apache.org/jira/browse/MESOS-8724 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > > On linux systems, the header `` defines three macros called > makedev(), major() and minor(). (See also > [http://man7.org/linux/man-pages/man3/makedev.3.html]) > Trying to compile Mesos using g++ 7.2.0 leads to the following warning: > {noformat} > ../include/csi/csi.pb.h:6042:13: error: In the GNU C Library, "minor" is > defined > by . For historical compatibility, it is > currently defined by as well, but we plan to > remove this soon. To use "minor", include > directly. If you did not intend to use a system-defined macro > "minor", you should undefine it after including . [-Werror] > inline ::google::protobuf::uint32 Version::minor() const { > {noformat} > The root cause is that csi.proto defines the following protobuf message: > {noformat} > message Version { > uint32 major = 1; // This field is REQUIRED. > uint32 minor = 2; // This field is REQUIRED. > uint32 patch = 3; // This field is REQUIRED. > } > {noformat} > The generated C++ in `csi.pb.h` headers will contain, amongst others, the > following function: > {noformat} > #include > // [6000 lines of code...] > inline ::google::protobuf::uint32 Version::major() const { > // @@protoc_insertion_point(field_get:csi.Version.major) > return major_; > } > {noformat} > And the recursive include structure of the header `` leads to > `stdlib.h` as follows: > {noformat} > . /usr/include/c++/7/string > .. /usr/include/c++/7/bits/basic_string.h > ... /usr/include/c++/7/ext/string_conversions.h > /usr/include/c++/7/cstdlib > . /usr/include/stdlib.h > .. /usr/include/x86_64-linux-gnu/sys/types.h > ... /usr/include/x86_64-linux-gnu/sys/sysmacros.h{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8728) Don't print full usage for invocation errors
Benno Evers created MESOS-8728: -- Summary: Don't print full usage for invocation errors Key: MESOS-8728 URL: https://issues.apache.org/jira/browse/MESOS-8728 Project: Mesos Issue Type: Improvement Reporter: Benno Evers The current usage string for mesos-master comes in at 399 lines, and for mesos-agent at 685 lines. Printing such a wall of text will overflow most terminal windows, making it necessary to scroll up to see the actual error when invoking mesos with an incorrect command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8728) Don't print full usage for invocation errors
[ https://issues.apache.org/jira/browse/MESOS-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411655#comment-16411655 ] Benno Evers commented on MESOS-8728: https://reviews.apache.org/r/63733/ > Don't print full usage for invocation errors > > > Key: MESOS-8728 > URL: https://issues.apache.org/jira/browse/MESOS-8728 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > > The current usage string for mesos-master comes in at 399 lines, and for > mesos-agent at 685 lines. > > Printing such a wall of text will overflow most terminal windows, making it > necessary to scroll up to see the actual error when invoking mesos with an > incorrect command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8711) SlaveTest.ChangeDomain is disabled.
[ https://issues.apache.org/jira/browse/MESOS-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411712#comment-16411712 ] Benno Evers commented on MESOS-8711: https://reviews.apache.org/r/66248/ > SlaveTest.ChangeDomain is disabled. > --- > > Key: MESOS-8711 > URL: https://issues.apache.org/jira/browse/MESOS-8711 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: disabled-test, flaky-test > > This test has been disabled in > https://github.com/apache/mesos/commit/c0468b240842d4aaf04249cb0a58c59c43d1850d. > We should either fix or remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7616) Consider supporting changes to agent's domain without full drain.
[ https://issues.apache.org/jira/browse/MESOS-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415463#comment-16415463 ] Benno Evers commented on MESOS-7616: Bookkeeping note: I've assigned the same number of story points to this and the corresponding epic MESOS-1739, please correct if this isn't the correct accounting method @[~vinodkone]. > Consider supporting changes to agent's domain without full drain. > - > > Key: MESOS-7616 > URL: https://issues.apache.org/jira/browse/MESOS-7616 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.5.0 > > > In the initial review chain, any change to an agent's domain requires a full > drain. This is simple and straightforward, but it makes it more difficult for > operators to opt-in to using fault domains. > We should consider allowing agents to transition from "no configured domain" > to "configured domain" without requiring an agent drain. This has some > complications, however: e.g., without an API for communicating changes in an > agent's configuration to frameworks, they might not realize that an agent's > domain has changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-1466: -- Resolution: Fixed Assignee: Meng Zhu > Race between executor exited event and launch task can cause overcommit of > resources > > > Key: MESOS-1466 > URL: https://issues.apache.org/jira/browse/MESOS-1466 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Major > Labels: reliability, twitter > > The following sequence of events can cause an overcommit > --> Launch task is called for a task whose executor is already running > --> Executor's resources are not accounted for on the master > --> Executor exits and the event is enqueued behind launch tasks on the master > --> Master sends the task to the slave which needs to commit for resources > for task and the (new) executor. > --> Master processes the executor exited event and re-offers the executor's > resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415612#comment-16415612 ] Benno Evers commented on MESOS-1466: If I understand the issue correctly, this race seems to have been eliminated as a side-effect of introducing the `launch_executor` flag in Mesos 1.5: When the master sends the `RunTaskMessage` to the agent, it thinks that the specified executor is still running on the agent, so it will set `launch_executor = false`: {noformat} // src/master/master.cpp:3841 bool Master::isLaunchExecutor( const ExecutorID& executorId, Framework* framework, Slave* slave) const { CHECK_NOTNULL(framework); CHECK_NOTNULL(slave); if (!slave->hasExecutor(framework->id(), executorId)) { CHECK(!framework->hasExecutor(slave->id, executorId)) << "Executor '" << executorId << "' known to the framework " << *framework << " but unknown to the agent " << *slave; return true; } return false; }{noformat} On the slave, when the executor doesn't exist anymore, the task is dropped with reason `REASON_EXECUTOR_TERMINATED`: {noformat} // src/slave/slave.cpp:2881 // Master does not want to launch executor. if (executor == nullptr) { // Master wants no new executor launched and there is none running on // the agent. This could happen if the task expects some previous // tasks to launch the executor. However, the earlier task got killed // or dropped hence did not launch the executor but the master doesn't // know about it yet because the `ExitedExecutorMessage` is still in // flight. In this case, we will drop the task. // // We report TASK_DROPPED to the framework because the task was // never launched. For non-partition-aware frameworks, we report // TASK_LOST for backward compatibility. mesos::TaskState taskState = TASK_DROPPED; if (!protobuf::frameworkHasCapability( frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) { taskState = TASK_LOST; } foreach (const TaskInfo& _task, tasks) { const StatusUpdate update = protobuf::createStatusUpdate( frameworkId, info.id(), _task.task_id(), taskState, TaskStatus::SOURCE_SLAVE, id::UUID::random(), "No executor is expected to launch and there is none running", TaskStatus::REASON_EXECUTOR_TERMINATED, executorId); statusUpdate(update, UPID()); } // We do not send `ExitedExecutorMessage` here because the expectation // is that there is already one on the fly to master. If the message // gets dropped, we will hopefully reconcile with the master later. return; }{noformat} > Race between executor exited event and launch task can cause overcommit of > resources > > > Key: MESOS-1466 > URL: https://issues.apache.org/jira/browse/MESOS-1466 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Vinod Kone >Priority: Major > Labels: reliability, twitter > > The following sequence of events can cause an overcommit > --> Launch task is called for a task whose executor is already running > --> Executor's resources are not accounted for on the master > --> Executor exits and the event is enqueued behind launch tasks on the master > --> Master sends the task to the slave which needs to commit for resources > for task and the (new) executor. > --> Master processes the executor exited event and re-offers the executor's > resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8801) Add jemalloc as optional third-party memory allocator
Benno Evers created MESOS-8801: -- Summary: Add jemalloc as optional third-party memory allocator Key: MESOS-8801 URL: https://issues.apache.org/jira/browse/MESOS-8801 Project: Mesos Issue Type: Improvement Reporter: Benno Evers As seen MESOS-7876, using jemalloc over the default memory allocator can have performance benefits. Additionally, this is also supports the use case of MESOS-7944 by providing an out-of-the-box option to enable memory profiling. (which is also the ticket referenced in the mailing list discussion about this) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8801) Add jemalloc as optional third-party memory allocator
[ https://issues.apache.org/jira/browse/MESOS-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442743#comment-16442743 ] Benno Evers commented on MESOS-8801: Review: https://reviews.apache.org/r/63366 > Add jemalloc as optional third-party memory allocator > - > > Key: MESOS-8801 > URL: https://issues.apache.org/jira/browse/MESOS-8801 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > > As seen MESOS-7876, using jemalloc over the default memory allocator can have > performance benefits. > > Additionally, this is also supports the use case of MESOS-7944 by providing > an out-of-the-box option to enable memory profiling. (which is also the > ticket referenced in the mailing list discussion about this) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8834) libprocess底层internal::send和internal::_send相互调用, 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题
[ https://issues.apache.org/jira/browse/MESOS-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454219#comment-16454219 ] Benno Evers commented on MESOS-8834: While I can't really understand the text, judging from the send -> _send -> send -> ... -> coredump sequence this looks like it might be the same issue as MESOS-8594? > libprocess底层internal::send和internal::_send相互调用, > 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题 > > > Key: MESOS-8834 > URL: https://issues.apache.org/jira/browse/MESOS-8834 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.5.0 >Reporter: liwuqi >Priority: Blocker > Labels: core, libprocess, send > > 如果某个process > while(true)发消息,将导致大量消息缓存在outgoing[socket]里,而在底层由internal::send和internal::_send去执行消息的发送,那么就会出现递归调用: > _send -> send -> _send ->send -> ... ->_send -> send -> > 导致调用栈不断增加,最终栈耗尽发生core dump问题. > 我本地测试,发现当栈层次达到40,000+时发生core dump > 为了解决这个问题,需要修改底层消息发送机制 > > 请关注这个问题,谢谢 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8797) Check failed in the default executor while running `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test.
[ https://issues.apache.org/jira/browse/MESOS-8797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454390#comment-16454390 ] Benno Evers commented on MESOS-8797: https://reviews.apache.org/r/66815/ > Check failed in the default executor while running > `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test. > > > Key: MESOS-8797 > URL: https://issues.apache.org/jira/browse/MESOS-8797 > Project: Mesos > Issue Type: Bug > Components: executor > Environment: Centos 7 SSL (internal CI) > master-[a95d9b8|https://github.com/apache/mesos/commit/a95d9b8fb53ab8fbf4a7b6d762c9e0749b4c013a] > (17-Apr-2018 14:03:14) >Reporter: Andrei Budnik >Priority: Major > Labels: flaky, flaky-test > Attachments: DefaultExecutorTest.TaskUsesExecutor-badrun.txt > > > {code:java} > lt-mesos-default-executor: ../../3rdparty/stout/include/stout/option.hpp:119: > T& Option::get() & [with T = std::basic_string]: Assertion > `isSome()' failed. > *** Aborted at 1523976443 (unix time) try "date -d @1523976443" if you are > using GNU date *** > PC: @ 0x7efcfd11f1f7 __GI_raise > *** SIGABRT (@0x4d44) received by PID 19780 (TID 0x7efcf5adb700) from PID > 19780; stack trace: *** > @ 0x7efcfd9da5e0 (unknown) > @ 0x7efcfd11f1f7 __GI_raise > @ 0x7efcfd1208e8 __GI_abort > @ 0x7efcfd118266 __assert_fail_base > @ 0x7efcfd118312 __GI___assert_fail > @ 0x55a05fa269f7 mesos::internal::DefaultExecutor::waited() > @ 0x7efd002212d1 process::ProcessBase::consume() > @ 0x7efd0023a52a process::ProcessManager::resume() > @ 0x7efd0023dfa6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7efd003f9470 execute_native_thread_routine > @ 0x7efcfd9d2e25 start_thread > @ 0x7efcfd1e234d __clone > {code} > Observed this failure in internal CI for test > {code:java} > MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8687) Check failure in `ProcessBase::_consume()`.
[ https://issues.apache.org/jira/browse/MESOS-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454395#comment-16454395 ] Benno Evers commented on MESOS-8687: Review for the test fix: https://reviews.apache.org/r/66799/ > Check failure in `ProcessBase::_consume()`. > --- > > Key: MESOS-8687 > URL: https://issues.apache.org/jira/browse/MESOS-8687 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.6.0 > Environment: ec2 CentOS 7 with SSL >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: flaky-test, reliability > Attachments: MasterAPITest.MasterFailover-with-CHECK.txt, > MasterFailover-badrun.txt > > > Observed a segfault in the {{MasterAPITest.MasterFailover}} test: > {noformat} > 10:59:04 I0319 10:59:04.312197 3274 master.cpp:649] Authorization enabled > 10:59:04 F0319 10:59:04.312772 3274 owned.hpp:110] Check failed: 'get()' > Must be non NULL > 10:59:04 *** Check failure stack trace: *** > 10:59:04 I0319 10:59:04.313470 3279 hierarchical.cpp:175] Initialized > hierarchical allocator process > 10:59:04 I0319 10:59:04.313500 3279 whitelist_watcher.cpp:77] No whitelist > given > 10:59:04 @ 0x7fe82d44e0cd google::LogMessage::Fail() > 10:59:04 @ 0x7fe82d44ff1d google::LogMessage::SendToLog() > 10:59:04 @ 0x7fe82d44dcb3 google::LogMessage::Flush() > 10:59:04 @ 0x7fe82d450919 google::LogMessageFatal::~LogMessageFatal() > 10:59:04 @ 0x7fe82d3cee16 google::CheckNotNull<>() > 10:59:04 @ 0x7fe82d3b4253 process::ProcessBase::_consume() > 10:59:04 @ 0x7fe82d3b4a66 > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase7consumeEONS1_9HttpEventEEUlRKNS1_5OwnedINS3_7Request_JSG_clEv > 10:59:04 @ 0x7fe82c39c3ca > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_ > 10:59:04 @ 0x7fe82d39f2c1 process::ProcessBase::consume() > 10:59:04 @ 0x7fe82d3b84da process::ProcessManager::resume() > 10:59:04 @ 0x7fe82d3bbf56 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 10:59:04 @ 0x7fe82d577870 execute_native_thread_routine > 10:59:04 @ 0x7fe82a761e25 start_thread > 10:59:04 @ 0x7fe82986334d __clone > {noformat} > Full test log is attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8869) Re-think semantics of os::system()
Benno Evers created MESOS-8869: -- Summary: Re-think semantics of os::system() Key: MESOS-8869 URL: https://issues.apache.org/jira/browse/MESOS-8869 Project: Mesos Issue Type: Improvement Reporter: Benno Evers The current posix implementation of stout's os::system() has two deficiencies that make its use harder than necessary: * Contrary to its documentation, in the case of an exec failure we don't return None but rather an exit code of 127. * The status obtained from waitpid() is returned directly, without WEXITSTATUS() being applied Together, these imply that code relying on some particular return value must apply WEXITSTATUS() itself (breaking the platform-indepence afforded by os::system()), and it cannot check if the program returned a value of 127/-1 at all. Intuitively, it seems the function might be more useful by only returning 0 if the call exited successfully, or None if any kind of error happened. We could also think about an additional platform-specific function {code:java} os::posix::system()` {code} that returns the raw return value of the executed function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492785#comment-16492785 ] Benno Evers commented on MESOS-7966: I tried to reproduce it running a custom Mesos 1.2 (compiled from de306b5786de3c221bae1457c6f2ccaeb38eef9f), modifying the provided call.py script by changing the hostnames and moving the timestamp into the future and then running it via {noformat} while :; python call.py; done {noformat} for a few minutes, but could not create a master crash. Looking at the code, I don't see any obvious race. The `Master::updateUnavailability()` handler in the master dispatches deletions for all existing inverse offers to the allocator actor, removes the offers from its own internal data structures, and afterwards dispatches a deletion for the maintenance to the allocator actor. The assertion triggers because the allocator gets a request to update an inverse offer when the maintenance doesn't exist yet/anymore, but I havent really found a code path that could lead to this. If you could update your filtered log to include the log lines generated by the following block in master.cpp, I think this would help to pin down the exact sequence of deletions/additions that triggers the crash: {noformat} if (unavailability.isSome()) { // TODO(jmlvanre): Add stream operator for unavailability. LOG(INFO) << "Updating unavailability of agent " << *slave << ", starting at " << Nanoseconds(unavailability.get().start().nanoseconds()); } else { LOG(INFO) << "Removing unavailability of agent " << *slave; } {noformat} > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Joseph Wu >Priority: Critical > Labels: mesosphere, reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)