[jira] [Commented] (MESOS-9766) /__processes__ endpoint can hang.
[ https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846645#comment-16846645 ] Alexander Rukletsov commented on MESOS-9766: {noformat:title=1.9.0 only} commit a8c411d3f8d2895ff5e95c412ef2f3e94713520f Author: Alexander Rukletsov AuthorDate: Fri May 3 13:23:50 2019 +0200 Commit: Alexander Rukletsov CommitDate: Thu May 23 12:58:32 2019 +0200 Logged when `/__processes__` returns. Adds a log entry when a response with generated by `/__processes__` is about to be returned to the client. Review: https://reviews.apache.org/r/70589 {noformat} > /__processes__ endpoint can hang. > - > > Key: MESOS-9766 > URL: https://issues.apache.org/jira/browse/MESOS-9766 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > > A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs. > Stack traces provided by [~alexr] revealed that all the threads appeared to > be idle waiting for events. After investigating the code, the issue was found > to be possible when a process gets terminated after the > {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the > dispatch and abandoning the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9791) Libprocess does not support server only SSL certificate verification.
[ https://issues.apache.org/jira/browse/MESOS-9791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844701#comment-16844701 ] Alexander Rukletsov commented on MESOS-9791: A prototype relaxing certificate verification: https://github.com/rukletsov/mesos/commits/alexr/ssl-server-cert > Libprocess does not support server only SSL certificate verification. > - > > Key: MESOS-9791 > URL: https://issues.apache.org/jira/browse/MESOS-9791 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Alexander Rukletsov >Priority: Major > Labels: foundations, mesosphere, security, ssl, tls > > Currently SSL certificate verification in Libprocess can be configured in the > [following > ways|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L88-L97]: > (1) send certificate if in server mode, verify peer certificates *if present*; > (2) require valid peer certificates in *both* client and server modes. > It is currently impossible to configure a Libprocess instance to > simultaneously: > (3) require valid peer certificate in client mode and send certificate in > server mode. > Because Libprocess is often used by programs that act both as servers and > clients, implementing (3) is necessary to enable the so-called > webserver-browser model. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9791) Libprocess does not support server only SSL certificate verification.
Alexander Rukletsov created MESOS-9791: -- Summary: Libprocess does not support server only SSL certificate verification. Key: MESOS-9791 URL: https://issues.apache.org/jira/browse/MESOS-9791 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Alexander Rukletsov Currently SSL certificate verification in Libprocess can be configured in the [following ways|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L88-L97]: (1) send certificate if in server mode, verify peer certificates *if present*; (2) require valid peer certificates in *both* client and server modes. It is currently impossible to configure a Libprocess instance to simultaneously: (3) require valid peer certificate in client mode and send certificate in server mode. Because Libprocess is often used by programs that act both as servers and clients, implementing (3) is necessary to enable the so-called webserver-browser model. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9790) Libprocess does not use standard tooling for hostname validation.
Alexander Rukletsov created MESOS-9790: -- Summary: Libprocess does not use standard tooling for hostname validation. Key: MESOS-9790 URL: https://issues.apache.org/jira/browse/MESOS-9790 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Alexander Rukletsov Libprocess currently uses [custom code|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L755-L863] for hostname validation in its SSL certificate verification workflow. However openssl provides a function for this, [{{X509_check_host()}} |https://www.openssl.org/docs/manmaster/man3/X509_check_host.html]. For safety and reliability, we should enable an option to use {{X509_check_host()}} for hostname validation instead of our custom code, but preserve the custom code for backward compatibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error
[ https://issues.apache.org/jira/browse/MESOS-9329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839483#comment-16839483 ] Alexander Rukletsov commented on MESOS-9329: Indeed, the autotools build uses a newer version of libevent, [2.0.22|https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/3rdparty/libevent-2.0.22-stable.tar.gz]. We can't easily use it in the cmake build because newer versions do not support cmake, see MESOS-3529. Bottom line is: a cmake build on Linux with ssl and libevent enabled is currently not supported. > CMake build on Fedora 28 fails due to libevent error > > > Key: MESOS-9329 > URL: https://issues.apache.org/jira/browse/MESOS-9329 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > > Trying to build Mesos using cmake with the options > {noformat} > cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1 > {noformat} > fails due to the following: > {noformat} > [ 1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o > /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c: > In function ‘bio_bufferevent_new’: > /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3: > error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’} > b->init = 0; >^~ > /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c: > At top level: > /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1: > error: variable ‘methods_bufferevent’ has initializer but incomplete type > static BIO_METHOD methods_bufferevent = { > [...] > {noformat} > Since the autotools build does not have issues when enabling libevent and > ssl, it seems most likely that the `libevent-2.1.5-beta` version used by > default in the cmake build is somehow connected to the error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9766) /__processes__ endpoint can hang.
[ https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833733#comment-16833733 ] Alexander Rukletsov commented on MESOS-9766: Logging processing time: https://reviews.apache.org/r/70589/ > /__processes__ endpoint can hang. > - > > Key: MESOS-9766 > URL: https://issues.apache.org/jira/browse/MESOS-9766 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > > A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs. > Stack traces provided by [~alexr] revealed that all the threads appeared to > be idle waiting for events. After investigating the code, the issue was found > to be possible when a process gets terminated after the > {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the > dispatch and abandoning the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode
[ https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830104#comment-16830104 ] Alexander Rukletsov commented on MESOS-9718: [~QuellaZhang], [~abudnik], the proposed patch basically reverts https://reviews.apache.org/r/58430/. I understand that the patch compiles on the newest version of MSVC toolset, but does it compile on the older versions that are currently in use? To phrase it differently, why reasons for introducing https://reviews.apache.org/r/58430/ do no apply any more? > Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode > -- > > Key: MESOS-9718 > URL: https://issues.apache.org/jira/browse/MESOS-9718 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > Attachments: mesos.patch.txt > > > Hi All, > We've stumbled across some build failures in Mesos after implementing support > for char8_t under /std:c + + +latest in the development version of Visual C+ > + +. Could you help look at this? Thanks in advance! Noted that this issue > only found when compiles with unreleased vctoolset, that next release of MSVC > will have this behavior. > *Repro steps:* > git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > open a VS 2017 x64 command prompt as admin and browse to D:\mesos > set _CL_=/std:c++latest > cd src > .\bootstrap.bat > cd .. > mkdir build_x64 && pushd build_x64 > cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > *Failures:* > base64_tests.i > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: > 'Try base64::decode_url_safe(const std::string &)': cannot > convert argument 1 from 'const char8_t [16]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot > convert from 'const char8_t [16]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: > 'AssertSomeEq': no matching overloaded function found > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2780: >
[jira] [Commented] (MESOS-7935) CMake build should fail immediately for in-source builds
[ https://issues.apache.org/jira/browse/MESOS-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787804#comment-16787804 ] Alexander Rukletsov commented on MESOS-7935: [~csnate] — could you please upload the diff? > CMake build should fail immediately for in-source builds > > > Key: MESOS-7935 > URL: https://issues.apache.org/jira/browse/MESOS-7935 > Project: Mesos > Issue Type: Improvement > Components: cmake > Environment: macOS 10.12 > GNU/Linux Debian Stretch >Reporter: Damien Gerard >Assignee: Nathan Jackson >Priority: Major > Labels: build > > In-source builds are neither recommended or supported. It is simple enough > to add a check to fail the build immediately. > --- > In-source build of master branch was broken with: > {noformat} > cd /Users/damien.gerard/projects/acp/mesos/src && > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ > -DBUILD_FLAGS=\"\" -DBUILD_JAVA_JVM_LIBRARY=\"\" -DHAS_AUTHENTICATION=1 > -DLIBDIR=\"/usr/local/libmesos\" -DPICOJSON_USE_INT64 > -DPKGDATADIR=\"/usr/local/share/mesos\" > -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" -DUSE_CMAKE_BUILD_CONFIG > -DUSE_STATIC_LIB -DVERSION=\"1.4.0\" -D__STDC_FORMAT_MACROS > -Dmesos_1_4_0_EXPORTS -I/Users/damien.gerard/projects/acp/mesos/include > -I/Users/damien.gerard/projects/acp/mesos/include/mesos > -I/Users/damien.gerard/projects/acp/mesos/src -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/protobuf-3.3.0/src/protobuf-3.3.0-lib/lib/include > -isystem /Users/damien.gerard/projects/acp/mesos/3rdparty/libprocess/include > -isystem /usr/local/opt/apr/libexec/include/apr-1 -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/boost-1.53.0/src/boost-1.53.0 > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/elfio-3.2/src/elfio-3.2 > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/glog-0.3.3/src/glog-0.3.3-lib/lib/include > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/nvml-352.79/src/nvml-352.79 > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/picojson-1.3.0/src/picojson-1.3.0 > -isystem /usr/local/include/subversion-1 -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/stout/include -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/http_parser-2.6.2/src/http_parser-2.6.2 > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/concurrentqueue-1.0.0-beta/src/concurrentqueue-1.0.0-beta > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/libev-4.22/src/libev-4.22 > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8/src/c/include > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8/src/c/generated > -isystem > /Users/damien.gerard/projects/acp/mesos/3rdparty/leveldb-1.19/src/leveldb-1.19/include > -std=c++11 -fPIC -o > CMakeFiles/mesos-1.4.0.dir/slave/containerizer/mesos/provisioner/backends/copy.cpp.o > -c > /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/backends/copy.cpp > /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/appc/store.cpp:132:46: > error: no member named 'fetcher' in namespace 'mesos::uri'; did you mean > 'Fetcher'? > Try> uriFetcher = uri::fetcher::create(); > ~^~~ > Fetcher > /Users/damien.gerard/projects/acp/mesos/include/mesos/uri/fetcher.hpp:46:7: > note: 'Fetcher' declared here > class Fetcher > ^ > /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/appc/store.cpp:132:55: > error: no member named 'create' in 'mesos::uri::Fetcher' > Try> uriFetcher = uri::fetcher::create(); > {noformat} > Both Linux & macOS, not tested elsewhere, on {{master}} and tag 1.4.0-rc3 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6674) Add Python sources to the CMake build
[ https://issues.apache.org/jira/browse/MESOS-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-6674: -- Assignee: (was: Srinivas) > Add Python sources to the CMake build > - > > Key: MESOS-6674 > URL: https://issues.apache.org/jira/browse/MESOS-6674 > Project: Mesos > Issue Type: Task > Components: cmake >Reporter: Joseph Wu >Priority: Major > Labels: microsoft > > The Python portion of the build includes a scheduler and executor driver as > well as Mesos protobufs. Eventually, there will also be a CLI component as > well. > See the automake sources for more details. i.e. > https://github.com/apache/mesos/blob/2a73d956af1cb0615d4e66de126ab554fdabb0b5/src/Makefile.am#L1726-L1752 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-2382) replace unsafe "find | xargs" with "find -exec"
[ https://issues.apache.org/jira/browse/MESOS-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-2382: -- Assignee: (was: Diana Arroyo) > replace unsafe "find | xargs" with "find -exec" > --- > > Key: MESOS-2382 > URL: https://issues.apache.org/jira/browse/MESOS-2382 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.20.1 >Reporter: Lukas Loesche >Priority: Major > Labels: easyfix, patch > > The problem exists in > 1194:src/Makefile.am > 47:src/tests/balloon_framework_test.sh > The current "find | xargs rm -rf" in the Makefile could potentially destroy > data if mesos source was in a folder with a space in the name. E.g. if you > for some reason checkout mesos to "/ mesos" the command in src/Makefile.am > would turn into a rm -rf / > "find | xargs" should be NUL delimited with "find -print0 | xargs -0" for > safer execution or can just be replaced with the find build-in option "find > -exec '{}' \+" which behaves similar to xargs. > There was a second occurrence of this in a test script, though in that case > it would only rmdir empty folders, so is less critical. > I submitted a PR here: https://github.com/apache/mesos/pull/36 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-2379) Disabled master authentication causes authentication retries in the scheduler.
[ https://issues.apache.org/jira/browse/MESOS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787793#comment-16787793 ] Alexander Rukletsov commented on MESOS-2379: B. seems to be implemented now: https://github.com/apache/mesos/blob/996862828ca9b7675e40b495fe24d95615bb832b/src/sched/sched.cpp#L487-L505 C. is questionable: for scheduler library to understand how to recover from {{AuthenticationErrorMessage}}, we should augment {{AuthenticationErrorMessage}} with a hint what kind of error has happened (we already do this in the error string), think {{Reason}} enum. On the other side, we might not want to mask such errors and make sure an operator is engaged: what if the intention was to enable authentication (and this is why scheduler tries it), but the master was misconfigured? > Disabled master authentication causes authentication retries in the > scheduler. > --- > > Key: MESOS-2379 > URL: https://issues.apache.org/jira/browse/MESOS-2379 > Project: Mesos > Issue Type: Bug > Components: security >Reporter: Till Toenshoff >Priority: Major > Labels: authentication, tech-debt > > The CRAM-MD5 authenticator relies upon shared credentials. Not supplying such > credentials while starting up a master effectively disables any > authentication. > A framework (or slave) may still attempt to authenticate which is answered by > an {{AuthenticationErrorMessage}} by the master. That in turn will cause the > authenticatee to fail its {{authenticate}} promise, which in turn will cause > the current framework driver implementation to infinitely (and unthrottled) > retry authentication. > See: https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L372 > {noformat} > if (reauthenticate || !future.isReady()) { > LOG(INFO) > << "Failed to authenticate with master " << master.get() << ": " > << (reauthenticate ? "master changed" : >(future.isFailed() ? future.failure() : "future discarded")); > authenticating = None(); > reauthenticate = false; > // TODO(vinod): Add a limit on number of retries. > dispatch(self(), ::authenticate); // Retry. > return; > } > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-3973) Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.
[ https://issues.apache.org/jira/browse/MESOS-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787789#comment-16787789 ] Alexander Rukletsov commented on MESOS-3973: [~chhsia0] The steps were: {noformat} git clone https://github.com/apache/mesos mesos-1.8.0 cd mesos-1.8.0 ./bootstrap mkdir build cd build ../configure make distcheck {noformat} However, saying {{make}} before {{make distcheck}} fixes this for me. > Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11. > - > > Key: MESOS-3973 > URL: https://issues.apache.org/jira/browse/MESOS-3973 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.21.0, 0.21.2, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0 > Environment: Mac OS X 10.10.5, Clang 7.0.0. >Reporter: Bernd Mathiske >Priority: Major > Labels: build, build-failure, mesosphere > Attachments: dist_check.log > > > Non-root 'make distcheck. > {noformat} > ... > [--] Global test environment tear-down > [==] 826 tests from 113 test cases ran. (276624 ms total) > [ PASSED ] 826 tests. > YOU HAVE 6 DISABLED TESTS > Making install in . > make[3]: Nothing to be done for `install-exec-am'. > ../install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig' > /usr/bin/install -c -m 644 mesos.pc > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig' > Making install in 3rdparty > /Applications/Xcode.app/Contents/Developer/usr/bin/make install-recursive > Making install in libprocess > Making install in 3rdparty > /Applications/Xcode.app/Contents/Developer/usr/bin/make install-recursive > Making install in stout > Making install in . > make[9]: Nothing to be done for `install-exec-am'. > make[9]: Nothing to be done for `install-data-am'. > Making install in include > make[9]: Nothing to be done for `install-exec-am'. > ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include' > ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include/stout' > /usr/bin/install -c -m 644 > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/abort.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/attributes.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/base64.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bits.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bytes.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/cache.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/check.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/duration.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/dynamiclibrary.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/exit.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/fs.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gtest.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashmap.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/interval.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/ip.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/json.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/lambda.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/list.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/mac.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multimap.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/none.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/nothing.hpp > >
[jira] [Assigned] (MESOS-2235) Better path handling when using system-wide installations of third party dependencies
[ https://issues.apache.org/jira/browse/MESOS-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-2235: -- Assignee: (was: Kapil Arya) > Better path handling when using system-wide installations of third party > dependencies > - > > Key: MESOS-2235 > URL: https://issues.apache.org/jira/browse/MESOS-2235 > Project: Mesos > Issue Type: Improvement > Components: build >Reporter: Kapil Arya >Priority: Minor > Labels: mesosphere > > Currently, if one wishes to use the system-wide installation of third party > dependencies such as protobuf, the following configure command line is used: > {code} > ../configure --with-protobuf=/usr > {code} > The configure scripts then adds "/usr/include" to include path and /usr/lib > to library path. However, on some 64-bit systems (e.g., OpenSuse), /usr/lib > points to the 32-bit libraries and thus the build system ends up printing a > bunch of warnings: > {code} > libtool: link: g++ -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -o > .libs/mesos-slave slave/mesos_slave-main.o -L/usr/lib ./.libs/libmesos.so > -lprotobuf -lsasl2 -lsvn_delta-1 -lsvn_subr-1 -lapr-1 -lcurl -lz -lpthread > -lrt -lunwind > /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: > skipping incompatible /usr/lib/libpthread.so when searching for -lpthread > /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: > skipping incompatible /usr/lib/librt.so when searching for -lrt > /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: > skipping incompatible /usr/lib/libm.so when searching for -lm > /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: > skipping incompatible /usr/lib/libc.so when searching for -lc > {code} > Further, if someone uses system-wide installations, we can omit the path with > the configure flag and the system should be able to pick the correct flags. > E.g, the above example becomes: > {code} > ../configure --with-protobuf > {code} > Since, the correct system include and lib dirs are already in the standard > path, we don't need to specify that path. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9638) Mesos masters do no authenticate with agents.
Alexander Rukletsov created MESOS-9638: -- Summary: Mesos masters do no authenticate with agents. Key: MESOS-9638 URL: https://issues.apache.org/jira/browse/MESOS-9638 Project: Mesos Issue Type: Improvement Components: agent, master Reporter: Alexander Rukletsov Currently Mesos agents do not verify that the messages they receive are coming from the leading master and haven't been tampered with. In untrusted environments this can be a source of security issues. There are a couple of ways to fix this: 1) implement Master authentication on the transport or application level for each {{agent}}<->{{master}} connection (this might not be sufficient to distinguish a master from the leading master) 2) implement Master authentication on the transport level (for the connection to be encrypted) upon agent registration and pass a secret to the master for all subsequent, possibly separate and unencrypted, connections (the secret can be leaked on an unencrypted connection). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-3973) Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.
[ https://issues.apache.org/jira/browse/MESOS-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786742#comment-16786742 ] Alexander Rukletsov commented on MESOS-3973: As of today, {{make distcheck}} for {{1.8.0-dev}} on Mac OS 10.13.6 still fails, while {{make check}} works. However, looking at the log, the problem now seems to be GRPC support. {noformat} touch libev-4.22-build-stamp ../protobuf-3.5.0/src/protoc -I../../../3rdparty/libprocess/src/tests --cpp_out=. ../../../3rdparty/libprocess/src/tests/benchmarks.proto ../protobuf-3.5.0/src/protoc -I../../../3rdparty/libprocess/src/tests --grpc_out=. ../../../3rdparty/libprocess/src/tests/grpc_tests.proto \ --plugin=protoc-gen-grpc=../grpc-1.10.0/bins/opt/grpc_cpp_plugin ../protobuf-3.5.0/src/protoc -I../../../3rdparty/libprocess/src/tests --cpp_out=. ../../../3rdparty/libprocess/src/tests/grpc_tests.proto /Library/Developer/CommandLineTools/usr/bin/make distdir-am (cd include && /Library/Developer/CommandLineTools/usr/bin/make top_distdir=../../../mesos-1.8.0 distdir=../../../mesos-1.8.0/3rdparty/libprocess/include \ am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir) /Library/Developer/CommandLineTools/usr/bin/make distdir-am /Library/Developer/CommandLineTools/usr/bin/make \ top_distdir="../../mesos-1.8.0" distdir="../../mesos-1.8.0/3rdparty/libprocess" \ dist-hook cp -r ../../../3rdparty/libprocess/3rdparty ../../mesos-1.8.0/3rdparty/libprocess/ (cd src && /Library/Developer/CommandLineTools/usr/bin/make top_distdir=../mesos-1.8.0 distdir=../mesos-1.8.0/src \ am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir) make[3]: *** No rule to make target `../include/csi/csi.grpc.pb.cc', needed by `distdir'. Stop. make[2]: *** [distdir-am] Error 1 make[1]: *** [distdir] Error 2 make: *** [dist] Error 2 {noformat} [~chhsia0] any chance you have an idea why off the top of your head? > Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11. > - > > Key: MESOS-3973 > URL: https://issues.apache.org/jira/browse/MESOS-3973 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.21.0, 0.21.2, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0 > Environment: Mac OS X 10.10.5, Clang 7.0.0. >Reporter: Bernd Mathiske >Priority: Major > Labels: build, build-failure, mesosphere > Attachments: dist_check.log > > > Non-root 'make distcheck. > {noformat} > ... > [--] Global test environment tear-down > [==] 826 tests from 113 test cases ran. (276624 ms total) > [ PASSED ] 826 tests. > YOU HAVE 6 DISABLED TESTS > Making install in . > make[3]: Nothing to be done for `install-exec-am'. > ../install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig' > /usr/bin/install -c -m 644 mesos.pc > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig' > Making install in 3rdparty > /Applications/Xcode.app/Contents/Developer/usr/bin/make install-recursive > Making install in libprocess > Making install in 3rdparty > /Applications/Xcode.app/Contents/Developer/usr/bin/make install-recursive > Making install in stout > Making install in . > make[9]: Nothing to be done for `install-exec-am'. > make[9]: Nothing to be done for `install-data-am'. > Making install in include > make[9]: Nothing to be done for `install-exec-am'. > ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include' > ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include/stout' > /usr/bin/install -c -m 644 > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/abort.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/attributes.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/base64.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bits.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bytes.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/cache.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/check.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/duration.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/dynamiclibrary.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/exit.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp >
[jira] [Assigned] (MESOS-1776) --without-PACKAGE will set incorrect dependency prefix
[ https://issues.apache.org/jira/browse/MESOS-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-1776: -- Assignee: (was: Kamil Domański) > --without-PACKAGE will set incorrect dependency prefix > -- > > Key: MESOS-1776 > URL: https://issues.apache.org/jira/browse/MESOS-1776 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.20.0 >Reporter: Kamil Domański >Priority: Major > Labels: build > > When disabling a particular bundled dependency with *--without-PACKAGE*, the > build scripts of both Mesos and libprocess will set a corresponding variable > to "no". This is later treated as prefix under which to search for the > package. > For example, with *--without-protobuf*, the script will search for *protoc* > under *no/bin* and obviously fail. I would propose to get rid of these > prefixes entirely and instead search in default locations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-1602) Add checks for unbundled libev
[ https://issues.apache.org/jira/browse/MESOS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-1602: -- Assignee: (was: Timothy St. Clair) > Add checks for unbundled libev > -- > > Key: MESOS-1602 > URL: https://issues.apache.org/jira/browse/MESOS-1602 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.20.0 >Reporter: Timothy St. Clair >Priority: Major > > Per review breakout a check to ensure libev has been compiled with > -DEV_CHILD_ENABLE=0 > Plus update checks for prefix'd installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9636) Autotools improvements
Alexander Rukletsov created MESOS-9636: -- Summary: Autotools improvements Key: MESOS-9636 URL: https://issues.apache.org/jira/browse/MESOS-9636 Project: Mesos Issue Type: Epic Components: build Reporter: Alexander Rukletsov -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-3973) Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.
[ https://issues.apache.org/jira/browse/MESOS-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-3973: -- Assignee: (was: Gilbert Song) > Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11. > - > > Key: MESOS-3973 > URL: https://issues.apache.org/jira/browse/MESOS-3973 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.21.0, 0.21.2, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0 > Environment: Mac OS X 10.10.5, Clang 7.0.0. >Reporter: Bernd Mathiske >Priority: Major > Labels: build, build-failure, mesosphere > Attachments: dist_check.log > > > Non-root 'make distcheck. > {noformat} > ... > [--] Global test environment tear-down > [==] 826 tests from 113 test cases ran. (276624 ms total) > [ PASSED ] 826 tests. > YOU HAVE 6 DISABLED TESTS > Making install in . > make[3]: Nothing to be done for `install-exec-am'. > ../install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig' > /usr/bin/install -c -m 644 mesos.pc > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig' > Making install in 3rdparty > /Applications/Xcode.app/Contents/Developer/usr/bin/make install-recursive > Making install in libprocess > Making install in 3rdparty > /Applications/Xcode.app/Contents/Developer/usr/bin/make install-recursive > Making install in stout > Making install in . > make[9]: Nothing to be done for `install-exec-am'. > make[9]: Nothing to be done for `install-data-am'. > Making install in include > make[9]: Nothing to be done for `install-exec-am'. > ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include' > ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d > '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include/stout' > /usr/bin/install -c -m 644 > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/abort.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/attributes.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/base64.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bits.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bytes.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/cache.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/check.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/duration.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/dynamiclibrary.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/exit.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/fs.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gtest.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashmap.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/interval.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/ip.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/json.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/lambda.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/list.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/mac.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multimap.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/none.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/nothing.hpp > > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/numify.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/option.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp > ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/path.hpp
[jira] [Assigned] (MESOS-2092) Make ACLs dynamic
[ https://issues.apache.org/jira/browse/MESOS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-2092: -- Assignee: (was: Yongqiao Wang) > Make ACLs dynamic > - > > Key: MESOS-2092 > URL: https://issues.apache.org/jira/browse/MESOS-2092 > Project: Mesos > Issue Type: Task > Components: security >Reporter: Alexander Rukletsov >Priority: Major > Labels: mesosphere, newbie > > Master loads ACLs once during its launch and there is no way to update them > in a running master. Making them dynamic will allow for updating ACLs on the > fly, for example granting a new framework necessary rights. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-4036) Install instructions for CentOS 6.6 lead to errors running `perf`.
[ https://issues.apache.org/jira/browse/MESOS-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-4036: -- Assignee: Alexander Rukletsov > Install instructions for CentOS 6.6 lead to errors running `perf`. > -- > > Key: MESOS-4036 > URL: https://issues.apache.org/jira/browse/MESOS-4036 > Project: Mesos > Issue Type: Improvement > Environment: CentOS 6.6 >Reporter: Greg Mann >Assignee: Alexander Rukletsov >Priority: Minor > Labels: integration, mesosphere, newbie > > After using the current installation instructions in the getting started > documentation, {{perf}} will not run on CentOS 6.6 because the version of > elfutils included in devtoolset-2 is not compatible with the version of > {{perf}} installed by {{yum}}. Installing and using devtoolset-3, however > (http://linux.web.cern.ch/linux/scientific6/docs/softwarecollections.shtml) > fixes this issue. This could be resolved by updating the getting started > documentation to recommend installing devtoolset-3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-5588: -- Assignee: (was: Jörg Schad) > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Jörg Schad >Priority: Major > Labels: mesosphere, security > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5027) Enable authenticated login in the webui
[ https://issues.apache.org/jira/browse/MESOS-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786559#comment-16786559 ] Alexander Rukletsov commented on MESOS-5027: Apparently, nothing grew out of the haosdent's attempt. Closing this as "won't do". > Enable authenticated login in the webui > --- > > Key: MESOS-5027 > URL: https://issues.apache.org/jira/browse/MESOS-5027 > Project: Mesos > Issue Type: Improvement > Components: master, security, webui >Reporter: Greg Mann >Assignee: haosdent >Priority: Major > Labels: mesosphere, security > Attachments: Screen Shot 2016-04-07 at 21.02.45.png > > > The webui hits a number of endpoints to get the data that it displays: > {{/state}}, {{/metrics/snapshot}}, {{/files/browse}}, {{/files/read}}, and > maybe others? Once authentication is enabled on these endpoints, we need to > add a login prompt to the webui so that users can provide credentials. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9579) ExecutorHttpApiTest.HeartbeatCalls is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785917#comment-16785917 ] Alexander Rukletsov commented on MESOS-9579: Another instance observed today on Ubuntu 14.04: {noformat} 20:42:56 [ RUN ] ExecutorHttpApiTest.HeartbeatCalls 20:42:56 I0305 20:42:56.060261 28896 executor.cpp:206] Version: 1.8.0 20:42:56 W0305 20:42:56.060288 28896 process.cpp:2829] Attempted to spawn already running process version@172.16.10.87:33003 20:42:56 I0305 20:42:56.060858 28899 executor.cpp:432] Connected with the agent 20:42:56 F0305 20:42:56.060952 28899 owned.hpp:112] Check failed: 'get()' Must be non NULL 20:42:56 *** Check failure stack trace: *** 20:42:56 @ 0x7fb09b359ead google::LogMessage::Fail() 20:42:56 @ 0x7fb09b35bcdd google::LogMessage::SendToLog() 20:42:56 @ 0x7fb09b359a9c google::LogMessage::Flush() 20:42:56 @ 0x7fb09b35c5d9 google::LogMessageFatal::~LogMessageFatal() 20:42:56 @ 0x7fb09d1d79fd google::CheckNotNull<>() 20:42:56 @ 0x7fb09d1be8c4 _ZNSt17_Function_handlerIFvvEZN5mesos8internal5tests39ExecutorHttpApiTest_HeartbeatCalls_Test8TestBodyEvEUlvE_E9_M_invokeERKSt9_Any_data 20:42:56 @ 0x7fb09a1441a0 process::AsyncExecutorProcess::execute<>() 20:42:56 @ 0x7fb09a153908 _ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_ 20:42:56 @ 0x7fb09b2ac961 process::ProcessBase::consume() 20:42:56 @ 0x7fb09b2bfbcc process::ProcessManager::resume() 20:42:56 @ 0x7fb09b2c5596 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv 20:42:56 @ 0x7fb09753da60 (unknown) 20:42:56 @ 0x7fb096d5a184 start_thread 20:42:56 @ 0x7fb096a8703d (unknown) 20:42:56 timeout: the monitored command dumped core 20:42:56 The test binary has crashed OR the timeout has been exceeded! {noformat} > ExecutorHttpApiTest.HeartbeatCalls is flaky. > > > Key: MESOS-9579 > URL: https://issues.apache.org/jira/browse/MESOS-9579 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.8.0 > Environment: Centos 6 >Reporter: Till Toenshoff >Priority: Major > Labels: flaky, flaky-test > > I just saw this failing on our internal CI: > {noformat} > 21:42:35 [ RUN ] ExecutorHttpApiTest.HeartbeatCalls > 21:42:35 I0215 21:42:35.917752 17173 executor.cpp:206] Version: 1.8.0 > 21:42:35 W0215 21:42:35.917771 17173 process.cpp:2829] Attempted to spawn > already running process version@172.16.10.166:35439 > 21:42:35 I0215 21:42:35.918581 17174 executor.cpp:432] Connected with the > agent > 21:42:35 F0215 21:42:35.918857 17174 owned.hpp:112] Check failed: 'get()' > Must be non NULL > 21:42:35 *** Check failure stack trace: *** > 21:42:35 @ 0x7fb93ce1d1dd google::LogMessage::Fail() > 21:42:35 @ 0x7fb93ce1ee7d google::LogMessage::SendToLog() > 21:42:35 @ 0x7fb93ce1cdb3 google::LogMessage::Flush() > 21:42:35 @ 0x7fb93ce1f879 google::LogMessageFatal::~LogMessageFatal() > 21:42:35 @ 0x55e80a099f76 google::CheckNotNull<>() > 21:42:35 @ 0x55e80a07dde4 > _ZNSt17_Function_handlerIFvvEZN5mesos8internal5tests39ExecutorHttpApiTest_HeartbeatCalls_Test8TestBodyEvEUlvE_E9_M_invokeERKSt9_Any_data > 21:42:35 @ 0x7fb93baea260 process::AsyncExecutorProcess::execute<>() > 21:42:35 @ 0x7fb93baf62cb > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEESG_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISW_EEOSE_S3_E_JSZ_SE_St12_PlaceholderILi1EEclEOS3_ > 21:42:36 @ 0x7fb93cd646b1 process::ProcessBase::consume() > 21:42:36 @ 0x7fb93cd794ba process::ProcessManager::resume() > 21:42:36 @ 0x7fb93cd7d486 > _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 21:42:36 @ 0x7fb93d02a1af execute_native_thread_routine > 21:42:36 @ 0x7fb939794aa1 start_thread > 21:42:36 @ 0x7fb938b39c4d clone > 21:42:36 The test binary has crashed OR the timeout has been exceeded! > 21:42:36 ~/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-centos-6 > 21:42:36 mkswap: /tmp/swapfile: warning: don't erase bootbits sectors > 21:42:36 on whole disk. Use -f to force. > 21:42:36 Setting up swapspace version 1, size = 8388604 KiB > 21:42:36 no label, UUID=dda5aa26-dba6-4ac8-bc6c-41264f510694 > 21:42:36 gcc (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3) > 21:42:36 Copyright (C)
[jira] [Commented] (MESOS-9322) Executor exited accidentally, but mesos-agent did not report TASK_FAILED event.
[ https://issues.apache.org/jira/browse/MESOS-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768514#comment-16768514 ] Alexander Rukletsov commented on MESOS-9322: [~guoshiwei] I agree and think it is a bug, too. We recently have at least two bugs related to "zombie executors": * MESOS-9502, stuck IOSwitchboard * MESOS-8125, MESOS-9501 pid reusal I would like to ask you to provide us with executor and agent logs, so we can determine whether you see one of the aforementioned issues or this is a separate bug. > Executor exited accidentally, but mesos-agent did not report TASK_FAILED > event. > --- > > Key: MESOS-9322 > URL: https://issues.apache.org/jira/browse/MESOS-9322 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.4.1 > Environment: Linux n14-068-081 4.4.0-33.bm.1-amd64 #1 SMP Fri, 01 Sep > 2017 18:36:21 +0800 x86_64 GNU/Linux > OS: debion 8.10 > mesos version: 1.4.1 >Reporter: Shiwei Guo >Priority: Major > > The log about this executor: > executorid: > 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz' > > {noformat} > I0914 10:40:36.448287 2505 slave.cpp:7336] Recovering executor > 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz' > of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67- > I0914 10:40:36.479209 2511 gc.cpp:58] Scheduling > '/opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/189e4b23-c892-4c87-9069-dfc98ca5edc8' > for gc 3.1546935280563days in the future > I0914 10:40:36.479287 2511 gc.cpp:58] Scheduling > '/opt/tiger/mesos_deploy/mesos_titan/slave/meta/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/189e4b23-c892-4c87-9069-dfc98ca5edc8' > for gc 3.15469352761481days in the future > I0914 10:40:36.479310 2511 gc.cpp:58] Scheduling > '/opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85' > for gc -1.02171850967407days in the future > I0914 10:40:36.479337 2511 gc.cpp:58] Scheduling > '/opt/tiger/mesos_deploy/mesos_titan/slave/meta/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85' > for gc -1.02171850987259days in the future > I0914 10:40:36.480459 2514 gc.cpp:169] Deleting > /opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85 > I0914 10:40:36.552492 2516 status_update_manager.cpp:211] Recovering executor > 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz' > of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67- > I0914 10:40:36.553234 2519 containerizer.cpp:665] Recovering container > 106c7257-fabb-4d58-8fcb-89b15bb9d404 for executor > 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz' > of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67- > I0914 10:40:36.591421 2514 gc.cpp:177] Deleted >
[jira] [Created] (MESOS-9562) Authorization for DESTROY and UNRESERVE is not symmetrical.
Alexander Rukletsov created MESOS-9562: -- Summary: Authorization for DESTROY and UNRESERVE is not symmetrical. Key: MESOS-9562 URL: https://issues.apache.org/jira/browse/MESOS-9562 Project: Mesos Issue Type: Improvement Components: master, scheduler api Affects Versions: 1.7.1 Reporter: Alexander Rukletsov For [the {{UNRESERVE}} case|https://github.com/apache/mesos/blob/5d3ed364c6d1307d88e6b950ae0eef423c426673/src/master/master.cpp#L3661-L3677], if the principal was not set, {{.has_principal()}} will be {{false}}, hence we will not call {{authorizations.push_back()}}, and hence we will not create an authz request with this resource as an object. For [the {{DESTROY}} case|https://github.com/apache/mesos/blob/5d3ed364c6d1307d88e6b950ae0eef423c426673/src/master/master.cpp#L3772-L3773], if the principal was not set, a default value {{""}} for string will be used and hence we will create an authz request with this resource as an object. We definitely need to make the behaviour consistent. I'm not sure which approach is correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-9143: -- Assignee: (was: Greg Mann) > MasterQuotaTest.RemoveSingleQuota is flaky. > --- > > Key: MESOS-9143 > URL: https://issues.apache.org/jira/browse/MESOS-9143 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Priority: Major > Labels: flaky, flaky-test, mesosphere > Attachments: RemoveSingleQuota-badrun.txt > > > {noformat} > ../../src/tests/master_quota_tests.cpp:493 > Value of: metrics.at(metricKey).isNone() > Actual: false > Expected: true > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9533) CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755853#comment-16755853 ] Alexander Rukletsov commented on MESOS-9533: I've reopened this because I have observed the same failure on the {{1.7.x}} branch. I've also set up fix versions to match those in MESOS-9518 since I suppose that are the branches where the test have been back introduced. > CniIsolatorTest.ROOT_CleanupAfterReboot is flaky. > - > > Key: MESOS-9533 > URL: https://issues.apache.org/jira/browse/MESOS-9533 > Project: Mesos > Issue Type: Bug > Components: cni, containerization >Affects Versions: 1.8.0 > Environment: centos-6 with SSL enabled >Reporter: Gilbert Song >Assignee: Gilbert Song >Priority: Major > Labels: cni, flaky-test > Fix For: 1.4.3, 1.5.3, 1.6.2, 1.7.2, 1.8.0 > > > {noformat} > Error Message > ../../src/tests/containerizer/cni_isolator_tests.cpp:2685 > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte > object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 > 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 > 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 > 0B-00 00-00>) > Expected: to be called 3 times >Actual: called 4 times - over-saturated and active > Stacktrace > ../../src/tests/containerizer/cni_isolator_tests.cpp:2685 > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte > object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 > 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 > 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 > 0B-00 00-00>) > Expected: to be called 3 times >Actual: called 4 times - over-saturated and active > {noformat} > It was from this commit > https://github.com/apache/mesos/commit/c338f5ada0123c0558658c6452ac3402d9fbec29 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-3123) DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged fails & crashes
[ https://issues.apache.org/jira/browse/MESOS-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-3123: -- Assignee: (was: Timothy Chen) > DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged fails & crashes > --- > > Key: MESOS-3123 > URL: https://issues.apache.org/jira/browse/MESOS-3123 > Project: Mesos > Issue Type: Bug > Components: docker, test >Affects Versions: 0.23.0 > Environment: CentOS 7.1, CentOS 6.6, or Ubuntu 14.04 > Mesos 0.23.0-rc4 or today's master > Docker 1.9 >Reporter: Adam B >Priority: Major > Labels: disabled-test, mesosphere > Fix For: 0.26.0 > > > Fails the test and then crashes while trying to shutdown the slaves. > {code} > [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged > ../../src/tests/docker_containerizer_tests.cpp:618: Failure > Value of: statusRunning.get().state() > Actual: TASK_LOST > Expected: TASK_RUNNING > ../../src/tests/docker_containerizer_tests.cpp:619: Failure > Failed to wait 1mins for statusFinished > ../../src/tests/docker_containerizer_tests.cpp:610: Failure > Actual function call count doesn't match EXPECT_CALL(sched, > statusUpdate(, _))... > Expected: to be called twice >Actual: called once - unsatisfied and active > F0721 21:59:54.950773 30622 logging.cpp:57] RAW: Pure virtual method called > @ 0x7f3915347a02 google::LogMessage::Fail() > @ 0x7f391534cee4 google::RawLog__() > @ 0x7f3914890312 __cxa_pure_virtual > @ 0x88c3ae mesos::internal::tests::Cluster::Slaves::shutdown() > @ 0x88c176 mesos::internal::tests::Cluster::Slaves::~Slaves() > @ 0x88dc16 mesos::internal::tests::Cluster::~Cluster() > @ 0x88dc87 mesos::internal::tests::MesosTest::~MesosTest() > @ 0xa529ab > mesos::internal::tests::DockerContainerizerTest::~DockerContainerizerTest() > @ 0xa8125f > mesos::internal::tests::DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test::~DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test() > @ 0xa8128e > mesos::internal::tests::DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test::~DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test() > @ 0x1218b4e testing::Test::DeleteSelf_() > @ 0x1221909 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x121cb38 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x1205713 testing::TestInfo::Run() > @ 0x1205c4e testing::TestCase::Run() > @ 0x120a9ca testing::internal::UnitTestImpl::RunAllTests() > @ 0x122277b > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x121d81b > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x120987a testing::UnitTest::Run() > @ 0xcfbf0c main > @ 0x7f391097caf5 __libc_start_main > @ 0x882089 (unknown) > make[3]: *** [check-local] Aborted (core dumped) > make[3]: Leaving directory `/home/me/mesos/build/src' > make[2]: *** [check-am] Error 2 > make[2]: Leaving directory `/home/me/mesos/build/src' > make[1]: *** [check] Error 2 > make[1]: Leaving directory `/home/me/mesos/build/src' > make: *** [check-recursive] Error 1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6780) ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably
[ https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-6780: -- Assignee: (was: Kevin Klues) > ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably > -- > > Key: MESOS-6780 > URL: https://issues.apache.org/jira/browse/MESOS-6780 > Project: Mesos > Issue Type: Bug > Environment: Mac OS 10.12, clang version 4.0.0 > (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) > (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), > libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46 >Reporter: Benjamin Bannier >Priority: Major > Labels: disabled-test, flaky-test, mesosphere > Attachments: attach_container_input_no_ssl.log > > > The test {{ContentType/AgentAPIStreamingTest.AttachContainerInput}} (both > {{/0}} and {{/1}}) fail consistently for me in an SSL-enabled, optimized > build. > {code} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ContentType/AgentAPIStreamingTest > [ RUN ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0 > I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' > authorizer > I1212 17:11:12.393844 17362944 master.cpp:380] Master > c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on > 172.18.8.114:51059 > I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" > --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master" > --zk_session_timeout="10secs" > I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing > authenticated frameworks to register > I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing > authenticated agents to register > I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for > authentication from > '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials' > I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' > authenticator > I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL > I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled > I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master! > I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar > I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the > registry (0B) in 4.131072ms > I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in > 27us; attempting to update the registry > I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the > registry in 4.10496ms > I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered > registrar > I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the > registry (136B); allowing 10mins for agents to re-register > I1212 17:11:12.422780 3971208128 containerizer.cpp:220] Using isolation: >
[jira] [Assigned] (MESOS-7023) IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky
[ https://issues.apache.org/jira/browse/MESOS-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7023: -- Assignee: (was: Kevin Klues) > IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky > --- > > Key: MESOS-7023 > URL: https://issues.apache.org/jira/browse/MESOS-7023 > Project: Mesos > Issue Type: Bug > Components: agent, test >Affects Versions: 1.2.2 > Environment: ASF CI, cmake, gcc, Ubuntu 14.04, without libevent/SSL >Reporter: Greg Mann >Priority: Major > Labels: debugging, disabled-test, flaky > Attachments: IOSwitchboardTest. > RecoverThenKillSwitchboardContainerDestroyed.txt > > > This was observed on ASF CI: > {code} > /mesos/src/tests/containerizer/io_switchboard_tests.cpp:1052: Failure > Value of: statusFailed->reason() > Actual: 1 > Expected: TaskStatus::REASON_IO_SWITCHBOARD_EXITED > Which is: 27 > {code} > Find full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8252) MasterAuthorizationTest.SlaveRemovedLost is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-8252: -- Assignee: (was: Alexander Rojas) > MasterAuthorizationTest.SlaveRemovedLost is flaky. > -- > > Key: MESOS-8252 > URL: https://issues.apache.org/jira/browse/MESOS-8252 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Priority: Major > Labels: flaky-test > Attachments: SlaveRemovedLost-badrun.txt > > > Observed it in the internal CI today. Most likely related to the recent > introduction of {{Abandoned}} future state. Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9491) There exists no way to statically configure a weight for a Mesos role
[ https://issues.apache.org/jira/browse/MESOS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729570#comment-16729570 ] Alexander Rukletsov commented on MESOS-9491: [~bbannier] Why do you static configuration would be useful? We wanted to move away from a concept of statically defining roles in a cluster. > There exists no way to statically configure a weight for a Mesos role > - > > Key: MESOS-9491 > URL: https://issues.apache.org/jira/browse/MESOS-9491 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Bannier >Priority: Major > > While it is possible to change the weight of any role at runtime over the > operator API, it seems we currently have no supported way to configure this > statically with configuration flags. Both the {{\-\-weights}} and {{--roles}} > flag would in principle allow this, but are deprecated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9499) Mesos supports only digest authentication scheme for Zookeeper.
[ https://issues.apache.org/jira/browse/MESOS-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-9499: -- Assignee: Dmitrii Kishchukov > Mesos supports only digest authentication scheme for Zookeeper. > --- > > Key: MESOS-9499 > URL: https://issues.apache.org/jira/browse/MESOS-9499 > Project: Mesos > Issue Type: Improvement >Affects Versions: 1.6.1, 1.7.0, 1.8.0 >Reporter: Alexander Rukletsov >Assignee: Dmitrii Kishchukov >Priority: Major > Labels: authentication, zookeeper > > Zookeeper has quite a flexible security model, of which Mesos supports digest > authentication only. This tickets aims to extend ZK authentication support in > Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9499) Mesos supports only digest authentication scheme for Zookeeper.
Alexander Rukletsov created MESOS-9499: -- Summary: Mesos supports only digest authentication scheme for Zookeeper. Key: MESOS-9499 URL: https://issues.apache.org/jira/browse/MESOS-9499 Project: Mesos Issue Type: Improvement Affects Versions: 1.7.0, 1.6.1, 1.8.0 Reporter: Alexander Rukletsov Zookeeper has quite a flexible security model, of which Mesos supports digest authentication only. This tickets aims to extend ZK authentication support in Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.
[ https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700415#comment-16700415 ] Alexander Rukletsov commented on MESOS-9419: I'd like to understand, why the user has not observed the issue prior to \{{1.5.x}}. [~chhsia0], when you say the issue "appears to be present there as well", does it mean you run your test against \{{1.0.x}}? > Executor to framework message crashes master if framework has not > re-registered. > > > Key: MESOS-9419 > URL: https://issues.apache.org/jira/browse/MESOS-9419 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.1, 1.6.1, 1.7.0 >Reporter: Benjamin Mahler >Assignee: Chun-Hung Hsiao >Priority: Blocker > > If the executor sends a framework message after a master failover, and the > framework has not yet re-registered with the master, this will crash the > master: > {code} > W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send > message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- > (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb) > F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE > *** Check failure stack trace: *** > *** @ 0x7f09e016b6cd google::LogMessage::Fail() > *** @ 0x7f09e016d38d google::LogMessage::SendToLog() > *** @ 0x7f09e016b2b3 google::LogMessage::Flush() > *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal() > *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal() > *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>() > *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage() > *** @ 0x7f09df3b06a4 > _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\ > > _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\ > _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 > std::_Function_handler<>::_M_invoke() > *** @ 0x7f09df36930f ProtobufProcess<>::consume() > *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume() > *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume() > *** @ 0x7f09e00d9c7a process::ProcessManager::resume() > *** @ 0x7f09e00dd836 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > *** @ 0x7f09dd467ac8 execute_native_thread_routine > *** @ 0x7f09dd6f6b50 start_thread > *** @ 0x7f09dcc7030d (unknown) > {code} > This is because Framework::send proceeds if the framework is disconnected. In > the case of a recovered framework, it will not have a pid or http connection > yet: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610 > {code} > // Sends a message to the connected framework. > template > void Framework::send(const Message& message) > { > if (!connected()) { > LOG(WARNING) << "Master attempted to send message to disconnected" > << " framework " << *this; > // XXX proceeds! > } > metrics.incrementEvent(message); > if (http.isSome()) { > if (!http->send(message)) { > LOG(WARNING) << "Unable to send event to framework " << *this << ":" ><< " connection closed"; > } > } else { > CHECK_SOME(pid); // XXX Will crash. > master->send(pid.get(), message); > } > } > {code} > The executor to framework path does not guard against the framework being > disconnected, unlike the status update path: > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495 > vs. > https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373 > It was reported that this crash didn't occur for the user on 1.2.0, however > the issue appears to present there as well, so we will try to backport a test > to see if it's indeed not occurring in 1.2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7991) fatal, check failed !framework->recovered()
[ https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7991: -- Assignee: (was: Alexander Rukletsov) > fatal, check failed !framework->recovered() > --- > > Key: MESOS-7991 > URL: https://issues.apache.org/jira/browse/MESOS-7991 > Project: Mesos > Issue Type: Bug >Reporter: Jack Crawford >Priority: Critical > Labels: reliability > > mesos master crashed on what appears to be framework recovery > mesos master version: 1.3.1 > mesos agent version: 1.3.1 > {code} > W0920 14:58:54.756364 25452 master.cpp:7568] Task > 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756369 25452 master.cpp:7568] Task > 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756376 25452 master.cpp:7568] Task > 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756381 25452 master.cpp:7568] Task > e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756386 25452 master.cpp:7568] Task > f838a03c-5cd4-47eb-8606-69b004d89808 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756392 25452 master.cpp:7568] Task > 685ca5da-fa24-494d-a806-06e03bbf00bd of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756397 25452 master.cpp:7568] Task > 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: > !framework->recovered() > *** Check failure stack trace: *** > @ 0x7f7bf80087ed google::LogMessage::Fail() > @ 0x7f7bf800a5a0 google::LogMessage::SendToLog() > @ 0x7f7bf80083d3 google::LogMessage::Flush() > @ 0x7f7bf800afc9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f7bf736fe7e > mesos::internal::master::Master::reconcileKnownSlave() > @ 0x7f7bf739e612 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f7bf73a580e > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc > RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS > 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_ > @ 0x7f7bf7f5e69c process::ProcessBase::visit() > @ 0x7f7bf7f71403 process::ProcessManager::resume() > @ 0x7f7bf7f7c127 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f7bf60b5c80 (unknown) > @ 0x7f7bf58c86ba start_thread > @ 0x7f7bf55fe3dd (unknown) > mesos-master.service: Main process exited, code=killed, status=6/ABRT > mesos-master.service: Unit entered failed state. > mesos-master.service: Failed with result 'signal'. > {code} > The issue happened again on Mesos 1.5 (docker mesos master from the > mesosphere docker repo): > {code} > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 > http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133 > Mar 11 10:04:33 research docker[4503]: I0311
[jira] [Commented] (MESOS-7991) fatal, check failed !framework->recovered()
[ https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697136#comment-16697136 ] Alexander Rukletsov commented on MESOS-7991: An update from a user: "The failure in this case seems to happen right after an agent drops out of the cluster - which is a similar failure condition to the first time I encountered this". {noformat} Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133 Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 master.cpp:5467] Processing DECLINE call for offers: [ 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 5e57f633-a69c-4009-b7 Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the registry Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 registrar.cpp:552] Successfully updated the registry in 175872ns Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 master.cpp:10018] Check failed: 'framework' Must be non NULL Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: *** Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d google::LogMessage::Fail() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 google::LogMessage::SendToLog() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 google::LogMessage::Flush() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 google::CheckNotNull<>() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 process::ProcessBase::consume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a process::ProcessManager::resume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown) Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) try "date -d @1520762676" if you are using GNU date *** Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown) Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 (TID 0x7f96b986d700) from PID 0; stack trace: *** Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c google::DumpStackTraceAndExit() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d google::LogMessage::Fail() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 google::LogMessage::SendToLog() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 google::LogMessage::Flush() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 google::CheckNotNull<>() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 process::ProcessBase::consume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a process::ProcessManager::resume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread
[jira] [Assigned] (MESOS-7748) Slow subscribers of streaming APIs can lead to Mesos OOMing.
[ https://issues.apache.org/jira/browse/MESOS-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7748: -- Assignee: (was: Alexander Rukletsov) > Slow subscribers of streaming APIs can lead to Mesos OOMing. > > > Key: MESOS-7748 > URL: https://issues.apache.org/jira/browse/MESOS-7748 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Priority: Critical > Labels: mesosphere, reliability > > For each active subscriber, Mesos master / slave maintains an event queue, > which grows over time if the subscriber does not read fast enough. As the > number of such "slow" subscribers grows, so does Mesos master / slave memory > consumption, which might lead to an OOM event. > Ideas to consider: > * Restrict the number of subscribers for the streaming APIs > * Check (ping) for inactive or "slow" subscribers > * Disconnect the subscriber when there are too many queued events in memory -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8975) Problem and solution overview for the slow API issue.
[ https://issues.apache.org/jira/browse/MESOS-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693568#comment-16693568 ] Alexander Rukletsov edited comment on MESOS-8975 at 11/20/18 5:41 PM: -- {noformat} commit 40dc508d59d547e867746bc6b5b82ced849687f8 Author: Alexander Rukletsov AuthorDate: Sun Nov 18 05:09:39 2018 +0100 Commit: Alexander Rukletsov CommitDate: Tue Nov 20 18:37:42 2018 +0100 Added MasterActorResponsiveness_BENCHMARK_Test. See summary. Review: https://reviews.apache.org/r/68131/ {noformat} was (Author: alexr): {noformat} Author: Alexander Rukletsov AuthorDate: Sun Nov 18 05:09:39 2018 +0100 Commit: Alexander Rukletsov CommitDate: Tue Nov 20 18:37:42 2018 +0100 Added MasterActorResponsiveness_BENCHMARK_Test. See summary. Review: https://reviews.apache.org/r/68131/ {noformat} > Problem and solution overview for the slow API issue. > - > > Key: MESOS-8975 > URL: https://issues.apache.org/jira/browse/MESOS-8975 > Project: Mesos > Issue Type: Task > Components: HTTP API >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: benchmark, performance > Fix For: 1.8.0 > > > Collect data from the clusters regarding {{state.json}} responsiveness, > figure out, where the bottlenecks are, and prepare an overview of solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9395) Check failure on
Alexander Rukletsov created MESOS-9395: -- Summary: Check failure on Key: MESOS-9395 URL: https://issues.apache.org/jira/browse/MESOS-9395 Project: Mesos Issue Type: Bug Components: resource provider Affects Versions: 1.7.0 Reporter: Alexander Rukletsov Observed the following agent failure on one of our staging clusters: {noformat} Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: I1116 11:57:24.641331 26684 http.cpp:1799] Processing GET_AGENT call Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: I1116 11:57:24.650429 26679 http.cpp:1117] HTTP POST for /slave(1)/api/v1/resource_provider from 172.31.8.65:57790 Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: I1116 11:57:24.650629 26679 manager.cpp:672] Subscribing resource provider {"attributes":[{"name":"lvm-vg-name","text":{"value":"lvm-double-1540383639"},"type":"SCALAR"},{"name":"dss-asset-id","text":{"value":"6AbZV6W2DrK4YgcIR3ICVo"},"type":"SCALAR"}],"default_reservations":[{"principal":"storage-principal","role":"dcos-storage","type":"DYNAMIC"}],"id":{"value":"8326e931-41f2-4f45-9174-13fe35c19300"},"name":"rp_6AbZV6W2DrK4YgcIR3ICVo","storage":{"plugin":{"containers":[{"command":{"environment":{"variables":[{"name":"PATH","type":"VALUE","value":"/opt/mesosphere/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"},{"name":"LD_LIBRARY_PATH","type":"VALUE","value":"/opt/mesosphere/lib"},{"name":"CONTAINER_LOGGER_DESTINATION_TYPE","type":"VALUE","value":"journald+logrotate"},{"name":"CONTAINER_LOGGER_EXTRA_LABELS","type":"VALUE","value":"{\"CSI_PLUGIN\":\"csilvm\"}"}]},"shell":true,"uris":[{"executable":true,"extract":false,"value":""}],"value":"echo \"a *:* rwm\" > /sys/fs/cgroup/devices`cat /proc/self/cgroup | grep devices | cut -d : -f 3`/devices.allow; exec ./csilvm -devices=/dev/xvdk,/dev/xvdj -volume-group=lvm-double-1540383639 -unix-addr-env=CSI_ENDPOINT -tag=6AbZV6W2DrK4YgcIR3ICVo"},"resources":[{"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"name":"mem","scalar":{"value":128.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":10.0},"type":"SCALAR"}],"services":["CONTROLLER_SERVICE","NODE_SERVICE"]}],"name":"plugin_6AbZV6W2DrK4YgcIR3ICVo","type":"io.mesosphere.dcos.storage.csilvm"}},"type":"org.apache.mesos.rp.local.storage"} Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: I1116 11:57:24.690474 26685 provider.cpp:546] Received SUBSCRIBED event Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: I1116 11:57:24.690521 26685 provider.cpp:1492] Subscribed with ID 8326e931-41f2-4f45-9174-13fe35c19300 Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: I1116 11:57:24.690657 26681 status_update_manager_process.hpp:314] Recovering operation status update manager Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: F1116 11:57:24.691496 26682 provider.cpp:3121] Check failed: resource.disk().source().has_profile() != resource.disk().source().has_id() (1 vs. 1) Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: *** Check failure stack trace: *** Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb099e9fd google::LogMessage::Fail() Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb09a082d google::LogMessage::SendToLog() Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb099e5ec google::LogMessage::Flush() Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb09a1129 google::LogMessageFatal::~LogMessageFatal() Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb01654ca mesos::internal::StorageLocalResourceProviderProcess::applyCreateDisk() Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb017c683 mesos::internal::StorageLocalResourceProviderProcess::_applyOperation() Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb017d64a _ZZN5mesos8internal35StorageLocalResourceProviderProcess26reconcileOperationStatusesEvENKUlRKNS0_26StatusUpdateManagerProcessIN2id4UUIDENS0_27UpdateOperationStatusRecordENS0_28UpdateOperationStatusMessageEE5StateEE_clESA_ Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re mesos-agent[26663]: @ 0x7fecb017dd21
[jira] [Commented] (MESOS-8723) ROOT_HealthCheckUsingPersistentVolume is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678453#comment-16678453 ] Alexander Rukletsov commented on MESOS-8723: This ^ bad run is likely https://jira.apache.org/jira/browse/MESOS-8096 > ROOT_HealthCheckUsingPersistentVolume is flaky. > --- > > Key: MESOS-8723 > URL: https://issues.apache.org/jira/browse/MESOS-8723 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: ec2's CentOS 7 with SSL >Reporter: Alexander Rukletsov >Priority: Major > Labels: flaky-test > Attachments: ROOT_HealthCheckUsingPersistentVolume-badrun.txt > > > {noformat} > ../../src/tests/cluster.cpp:660: Failure > Failed to wait 15secs for destroy > I0321 19:45:11.676262 8064 master.cpp:1137] Master terminating > I0321 19:45:11.676625 27242 hierarchical.cpp:609] Removed agent > b7675b9a-d9e9-4c97-a5c2-d50fc6101301-S0 > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8780) Expose Check and HealthCheck information on Mesos HTTP endpoints.
[ https://issues.apache.org/jira/browse/MESOS-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658741#comment-16658741 ] Alexander Rukletsov commented on MESOS-8780: Let's keep this one open: it's good to have checks and health checks as much in sync as possible. > Expose Check and HealthCheck information on Mesos HTTP endpoints. > - > > Key: MESOS-8780 > URL: https://issues.apache.org/jira/browse/MESOS-8780 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Adam Medziński >Assignee: Greg Mann >Priority: Minor > Labels: api, integration, mesosphere > > Is the information about task health check definition not exposed on Mesos > HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific > reason? I'm working on integration with Hashicorp Consul and it would allow > me to synchronize the definitions of health checks only by using HTTP API. If > this information is not exposed by accident, I will gladly make a pull > request. > This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and > {{v1}} APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6417) Introduce an extra 'unknown' health check state.
[ https://issues.apache.org/jira/browse/MESOS-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-6417: -- Shepherd: Alexander Rukletsov Assignee: Greg Mann (was: Alexander Rukletsov) Sprint: Mesosphere RI-6 Sprint 2018-31 Story Points: 5 > Introduce an extra 'unknown' health check state. > > > Key: MESOS-6417 > URL: https://issues.apache.org/jira/browse/MESOS-6417 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Greg Mann >Priority: Major > Labels: health-check, mesosphere > > There are three logical states regarding health checks: > 1) no health checks; > 2) a health check is defined, but no result is available yet; > 3) a health check is defined, it is either healthy or not. > Currently, we do not distinguish between 1) and 2), which can be problematic > for framework authors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8780) Expose Check and HealthCheck information on Mesos HTTP endpoints.
[ https://issues.apache.org/jira/browse/MESOS-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-8780: -- Shepherd: Greg Mann Assignee: Alexander Rukletsov Sprint: Mesosphere RI-6 Sprint 2018-31 Story Points: 5 Labels: api integration mesosphere (was: ) Description: Is the information about task health check definition not exposed on Mesos HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific reason? I'm working on integration with Hashicorp Consul and it would allow me to synchronize the definitions of health checks only by using HTTP API. If this information is not exposed by accident, I will gladly make a pull request. This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and {{v1}} APIs. was:Is the information about task health check definition not exposed on Mesos HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific reason? I'm working on integration with Hashicorp Consul and it would allow me to synchronize the definitions of health checks only by using HTTP API. If this information is not exposed by accident, I will gladly make a pull request. Component/s: HTTP API Issue Type: Improvement (was: Story) Summary: Expose Check and HealthCheck information on Mesos HTTP endpoints. (was: Expose HealthCheck information on Mesos HTTP endpoints) > Expose Check and HealthCheck information on Mesos HTTP endpoints. > - > > Key: MESOS-8780 > URL: https://issues.apache.org/jira/browse/MESOS-8780 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Adam Medziński >Assignee: Alexander Rukletsov >Priority: Minor > Labels: api, integration, mesosphere > > Is the information about task health check definition not exposed on Mesos > HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific > reason? I'm working on integration with Hashicorp Consul and it would allow > me to synchronize the definitions of health checks only by using HTTP API. If > this information is not exposed by accident, I will gladly make a pull > request. > This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and > {{v1}} APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9317) Some master endpoints do not handle failed authorization properly.
Alexander Rukletsov created MESOS-9317: -- Summary: Some master endpoints do not handle failed authorization properly. Key: MESOS-9317 URL: https://issues.apache.org/jira/browse/MESOS-9317 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.7.0, 1.6.1, 1.5.1 Reporter: Alexander Rukletsov When we authorize _some_ actions (right now I see this happening to create / destroy volumes, reserve / unreserve resources) *and* {{authorizer}} fails (i.e. returns the future in non-ready state), an assertion is triggered: {noformat} mesos-master[49173]: F1015 11:40:29.795748 49396 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attempts {noformat} This is due to incorrect assumption in our code, see for example [https://github.com/apache/mesos/blob/a063afce9868dcee38a0ab7efaa028244f3999cf/src/master/master.cpp#L3752-L3763]: {noformat} return await(authorizations) .then([](const vector>& authorizations) -> Future { // Compute a disjunction. foreach (const Future& authorization, authorizations) { if (!authorization.get()) { return false; } } return true; }); {noformat} Futures returned from {{await}} are guaranteed to be in terminal state, but not necessarily ready! In the snippet above, {{!authorization.get()}} is invoked without being checked ⇒ assertion fails. Full stack trace: {noformat} Oct 15 11:40:39 int-master2-mwst9.scaletesting.mesosphe.re mesos-master[49173]: F1015 11:40:29.795748 49396 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796037 49395 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796097 49384 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796249 49393 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796375 49390 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796483 49388 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796629 49381 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796700 49385 future.hpp:1306] Check failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve permissions from IAM at url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request failed: Failed to contact bouncer at https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time out after 3 attemptsF1015 11:40:29.796780 49386 future.hpp:1306] Check failed:
[jira] [Commented] (MESOS-9277) UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN.
[ https://issues.apache.org/jira/browse/MESOS-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649340#comment-16649340 ] Alexander Rukletsov commented on MESOS-9277: IIUC, scheduler API is a stream hence using {{process::Sequence}} should be sufficient. > UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN. > > > Key: MESOS-9277 > URL: https://issues.apache.org/jira/browse/MESOS-9277 > Project: Mesos > Issue Type: Bug > Components: scheduler api >Affects Versions: 1.5.1, 1.6.1, 1.7.0 >Reporter: Alexander Rukletsov >Priority: Major > Labels: mesosphere, v1_api > > A typical use pattern for a framework scheduler is to remove its reservations > before tearing itself down. However, it is racy: {{UNRESERVE}} is a > multi-stage action which aborts if the framework is removed in-between. > *Solution 1* > Let schedulers use operation feedback and expect them to wait for an ack for > {{UNRESERVE}} before they send {{TEARDOWN}}. Kind of science fiction with a > timeline of {{O(months)}} and still possibilities for the race if a scheduler > does not comply. > *Solution 2* > Serialize calls for schedulers. For example, we can chain [handlers > here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711] > onto per-{{Master::Framework}} > [{{process::Sequence}}|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/3rdparty/libprocess/include/process/sequence.hpp]. > For that however, handlers must provide futures indicating when the > processing of the call is finished, note that most [handlers > here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711] > return void. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7693) DEBUG container does not inherit env variable properly for command tasks.
[ https://issues.apache.org/jira/browse/MESOS-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7693: -- Assignee: (was: Alexander Rukletsov) > DEBUG container does not inherit env variable properly for command tasks. > - > > Key: MESOS-7693 > URL: https://issues.apache.org/jira/browse/MESOS-7693 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.3.0 >Reporter: Jie Yu >Priority: Major > > I can repo the issue: > {code} > sudo /home/vagrant/workspace/dist/mesos-1.4.0/bin/mesos-execute > --master=172.28.128.3:5050 --name=java8 --docker_image=java:8 > --command="sleep 1000" > I0618 17:42:21.410598 3356 scheduler.cpp:184] Version: 1.4.0 > I0618 17:42:21.413465 3356 scheduler.cpp:470] New master detected at > master@172.28.128.3:5050 > Subscribed with ID cacf5c08-cbbc-401a-a84d-2cfc4edc6519-0006 > Submitted task 'java8' to agent 'cacf5c08-cbbc-401a-a84d-2cfc4edc6519-S0' > Received status update TASK_RUNNING for task 'java8' > source: SOURCE_EXECUTOR > Jies-MacBook-Pro:script jie$ ./dcos task > NAME HOST USER STATE ID > java8 172.28.128.3 rootRjava8 > Jies-MacBook-Pro:script jie$ ./dcos task exec -t -i java8 bash > root@vagrant-ubuntu-trusty-64:/mnt/mesos/sandbox# env > LIBPROCESS_IP=172.28.128.3 > MESOS_AGENT_ENDPOINT=172.28.128.3:5051 > MESOS_DIRECTORY=/tmp/mesos/slave/slaves/cacf5c08-cbbc-401a-a84d-2cfc4edc6519-S0/frameworks/cacf5c08-cbbc-401a-a84d-2cfc4edc6519-0006/executors/java8/runs/1b06c661-20f3-460a-8cfd-475dc3e60aa3 > MESOS_EXECUTOR_ID=java8 > PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin > PWD=/mnt/mesos/sandbox > MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD=5secs > MESOS_NATIVE_JAVA_LIBRARY=/home/vagrant/workspace/dist/mesos-1.4.0/lib/libmesos-1.4.0.so > MESOS_NATIVE_LIBRARY=/home/vagrant/workspace/dist/mesos-1.4.0/lib/libmesos-1.4.0.so > MESOS_HTTP_COMMAND_EXECUTOR=0 > MESOS_SLAVE_PID=slave(1)@172.28.128.3:5051 > MESOS_FRAMEWORK_ID=cacf5c08-cbbc-401a-a84d-2cfc4edc6519-0006 > MESOS_CHECKPOINT=0 > SHLVL=1 > LIBPROCESS_PORT=0 > MESOS_SLAVE_ID=cacf5c08-cbbc-401a-a84d-2cfc4edc6519-S0 > MESOS_SANDBOX=/mnt/mesos/sandbox > _=/usr/bin/env > {code} > As you can see, environment variables like JAVA_HOME defined in the docker > image are not in the debug container. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8907) curl fetcher fails with HTTP/2
[ https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645579#comment-16645579 ] Alexander Rukletsov commented on MESOS-8907: [~tillt] the fix sounds reasonable to me, however, I'd like to confirm first, that the version of curl used in Ubuntu 18 started using HTTP/2 by default, which was the case for Ubuntu 16. > curl fetcher fails with HTTP/2 > -- > > Key: MESOS-8907 > URL: https://issues.apache.org/jira/browse/MESOS-8907 > Project: Mesos > Issue Type: Task > Components: fetcher >Reporter: James Peach >Priority: Major > Labels: integration > > {noformat} > [ RUN ] > ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 > ... > I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image > 'quay.io/coreos/alpine-sh' from > 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to > '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn' > E0510 20:52:00.756072 25003 slave.cpp:6176] Container > '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor > 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework > 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode > HTTP responses: Decoding failed > HTTP/2 200 > server: nginx/1.13.12 > date: Fri, 11 May 2018 03:52:00 GMT > content-type: application/vnd.docker.distribution.manifest.v1+prettyjws > content-length: 4486 > docker-content-digest: > sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663 > x-frame-options: DENY > strict-transport-security: max-age=63072000; preload > ... > {noformat} > Note that curl is saying the HTTP version is "HTTP/2". This happens on modern > curl that automatically negotiates HTTP/2, but the docker fetcher isn't > prepared to parse that. > {noformat} > $ curl -i --raw -L -s -S -o - 'http://quay.io/coreos/alpine-sh?latest#https' > HTTP/1.1 301 Moved Permanently > Content-Type: text/html > Date: Fri, 11 May 2018 04:07:44 GMT > Location: https://quay.io/coreos/alpine-sh?latest > Server: nginx/1.13.12 > Content-Length: 186 > Connection: keep-alive > HTTP/2 301 > server: nginx/1.13.12 > date: Fri, 11 May 2018 04:07:45 GMT > content-type: text/html; charset=utf-8 > content-length: 287 > location: https://quay.io/coreos/alpine-sh/?latest > x-frame-options: DENY > strict-transport-security: max-age=63072000; preload > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8999) Add default bodies for libprocess HTTP error responses.
[ https://issues.apache.org/jira/browse/MESOS-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-8999: -- Shepherd: Alexander Rukletsov Assignee: Benno Evers Sprint: Mesosphere RI-6 Sprint 2018-30 Story Points: 3 Labels: mesosphere observability (was: ) Component/s: libprocess > Add default bodies for libprocess HTTP error responses. > --- > > Key: MESOS-8999 > URL: https://issues.apache.org/jira/browse/MESOS-8999 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: mesosphere, observability > > By default on error libprocess would only return a response > with the correct status code and no response body. > However, most browsers do not visually indicate the response > status code, so if any error occurs anyone using a browser will only > see a blank page, making it hard to figure out what happened. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9298) Task failures sometimes can't be understood without looking into agent logs.
Alexander Rukletsov created MESOS-9298: -- Summary: Task failures sometimes can't be understood without looking into agent logs. Key: MESOS-9298 URL: https://issues.apache.org/jira/browse/MESOS-9298 Project: Mesos Issue Type: Epic Components: scheduler api Reporter: Alexander Rukletsov Mesos communicates task state transitions via task status updates. They often include a reason, which aims to hint what exactly went wrong. However, these reasons are often: - misleading - vague - generic. Needless to say, this complicates triaging why the task has actually failed and hence is a bad user experience. The failures can come from a bunch of different sources: fetcher, isolators (including custom ones!), namespace setup, etc. This epic aims to improve the UX by providing detailed, ideally typed, information about task failures. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.
[ https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639361#comment-16639361 ] Alexander Rukletsov commented on MESOS-9274: *Backports to 1.7.1:* {noformat} 830a7d53218ae472d10cf5733dab2c13600638b2 f8ba9e3f4fb1bb8fe7d0e35bd3d92696cb8381a7 {noformat} *Backports to 1.6.2:* {noformat} 6ec452b7ecaae63a1eb79416b58ac5916c3fff6c e26e907ff72670877af6b7868634df335d04006d {noformat} These patches can't be back ported to 1.5.x because [the scheduler library|https://github.com/apache/mesos/blob/ba960ed45e80119eadf398abd72609538fbc983e/include/mesos/v1/scheduler.hpp#L65] does not provide {{call()}} method there, which [was introduced|https://github.com/apache/mesos/commit/c39ef69514e57ca7c90e764a4a617abf88cd144f#diff-008387c75189aa7afcf0726f8d22530b] in Mesos 1.6.0. > v1 JAVA scheduler library can drop TEARDOWN upon destruction. > - > > Key: MESOS-9274 > URL: https://issues.apache.org/jira/browse/MESOS-9274 > Project: Mesos > Issue Type: Bug > Components: java api, scheduler driver >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Major > Labels: api, mesosphere, scheduler > Fix For: 1.6.2, 1.7.1, 1.8.0 > > > Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent > to the master nor waits for responses. This can be problematic if the library > is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: > destruction of the underlying {{Mesos}} actor races with sending the call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.
[ https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630527#comment-16630527 ] Alexander Rukletsov edited comment on MESOS-9274 at 10/4/18 9:01 AM: - I see several possible solutions here: * Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is sent. This is out of our control hence does not seem like a good solution or user experience * Add {{sleep(5)}} in [{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258]. This is a hacky solution but it [_follows the pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082] ;). * Use {{Mesos::call()}} instead of {{Mesos::send()}} and wait for the response in {{v1Mesos::send()}}. This seems like the cleanest solution. was (Author: alexr): I see several possible solutions here: * Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is sent. This is out of our control hence does not seem like a good solution or user experience * Add {{sleep(5)}} in [{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258]. This is a hacky solution but it [_follows the pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082] ;). * Use {[Mesos::call()}} instead of {{Mesos::send()}} and wait for the response in {{v1Mesos::send()}}. This seems like the cleanest solution. > v1 JAVA scheduler library can drop TEARDOWN upon destruction. > - > > Key: MESOS-9274 > URL: https://issues.apache.org/jira/browse/MESOS-9274 > Project: Mesos > Issue Type: Bug > Components: java api, scheduler driver >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Major > Labels: api, mesosphere, scheduler > > Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent > to the master nor waits for responses. This can be problematic if the library > is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: > destruction of the underlying {{Mesos}} actor races with sending the call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.
[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276 ] Alexander Rukletsov edited comment on MESOS-9116 at 9/28/18 2:31 PM: - Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} was (Author: alexr): Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} Backports to 1.4.x (partial): {noformat} c37eb59e4c4b7b6c16509f317c78207da6eeb485 {noformat} > Launch nested container session fails due to incorrect detection of `mnt` > namespace of command executor's task. > --- > > Key: MESOS-9116 > URL: https://issues.apache.org/jira/browse/MESOS-9116 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0 >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Critical > Labels: mesosphere > Fix For: 1.5.2, 1.6.2, 1.7.0 > > Attachments: pstree.png > > > Launch nested container call might fail with the following error: > {code:java} > Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such > file or directory > {code} > This happens when the containerizer launcher [tries to > enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892] > `mnt` namespace using the pid of a terminated process. The pid [was > detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958] > by the agent before spawning the containerizer launcher process, because the > process was running back then. > The issue can be reproduced using the following test (pseudocode): > {code:java} > launchTask("sleep 1000") > parentContainerId = containerizer.containers().begin() > outputs = [] > for i in range(10): > ContainerId containerId > containerId.parent = parentContainerId > containerId.id = UUID.random() > LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo") > response = ATTACH_CONTAINER_OUTPUT(containerId) > outputs.append(response.reader) > for output in outputs: > stdout, stderr = getProcessIOData(output) > assert("echo" == stdout + stderr){code} > When we start the very first nested container, `getMountNamespaceTarget()` > returns a PID of the task (`sleep 1000`), because it's the only process whose > `mnt` namespace differs from the parent container. This nested container > becomes a child of PID 1 process, which is also a parent of the command > executor. It's not an executor's child! It can be seen in attached > `pstree.png`. > When we start a second nested container, `getMountNamespaceTarget()` might > return PID of the previous nested container (`echo echo`) instead of the > task's PID (`sleep 1000`). It happens because the first nested container > entered `mnt` namespace of the task. Then, the containerizer launcher > ("nanny" process) attempts to enter `mnt` namespace using the PID of a > terminated process, so we get this error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9277) UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN.
Alexander Rukletsov created MESOS-9277: -- Summary: UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN. Key: MESOS-9277 URL: https://issues.apache.org/jira/browse/MESOS-9277 Project: Mesos Issue Type: Bug Components: scheduler api Affects Versions: 1.7.0, 1.6.1, 1.5.1 Reporter: Alexander Rukletsov A typical use pattern for a framework scheduler is to remove its reservations before tearing itself down. However, it is racy: {{UNRESERVE}} is a multi-stage action which aborts if the framework is removed in-between. *Solution 1* Let schedulers use operation feedback and expect them to wait for an ack for {{UNRESERVE}} before they send {{TEARDOWN}}. Kind of science fiction with a timeline of {{O(months)}} and still possibilities for the race if a scheduler does not comply. *Solution 2* Serialize calls for schedulers. For example, we can chain [handlers here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711] onto per-{{Master::Framework}} [{{process::Sequence}}|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/3rdparty/libprocess/include/process/sequence.hpp]. For that however, handlers must provide futures indicating when the processing of the call is finished, note that most [handlers here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711] return void. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.
[ https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630527#comment-16630527 ] Alexander Rukletsov commented on MESOS-9274: I see several possible solutions here: * Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is sent. This is out of our control hence does not seem like a good solution or user experience * Add {{sleep(5)}} in [{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258]. This is a hacky solution but it [_follows the pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082] ;). * Use {[Mesos::call()}} instead of {{Mesos::send()}} and wait for the response in {{v1Mesos::send()}}. This seems like the cleanest solution. > v1 JAVA scheduler library can drop TEARDOWN upon destruction. > - > > Key: MESOS-9274 > URL: https://issues.apache.org/jira/browse/MESOS-9274 > Project: Mesos > Issue Type: Bug > Components: java api, scheduler driver >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Major > Labels: api, mesosphere, scheduler > > Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent > to the master nor waits for responses. This can be problematic if the library > is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: > destruction of the underlying {{Mesos}} actor races with sending the call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.
Alexander Rukletsov created MESOS-9274: -- Summary: v1 JAVA scheduler library can drop TEARDOWN upon destruction. Key: MESOS-9274 URL: https://issues.apache.org/jira/browse/MESOS-9274 Project: Mesos Issue Type: Bug Components: java api, scheduler driver Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent to the master nor waits for responses. This can be problematic if the library is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: destruction of the underlying {{Mesos}} actor races with sending the call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9257) AgentAPITest.LaunchNestedContainerSessionsInParallel is flaky
[ https://issues.apache.org/jira/browse/MESOS-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629460#comment-16629460 ] Alexander Rukletsov commented on MESOS-9257: Disabled this test for now in {{af5af29ce217f63aeec59bed81f2a742d2c5602a}}. > AgentAPITest.LaunchNestedContainerSessionsInParallel is flaky > - > > Key: MESOS-9257 > URL: https://issues.apache.org/jira/browse/MESOS-9257 > Project: Mesos > Issue Type: Bug > Components: agent > Environment: Debian \{8, 9} SSL >Reporter: Andrei Budnik >Priority: Major > Labels: flaky-test > Attachments: LaunchNestedContainerSessionsInParallel-badrun.txt > > > {code:java} > ../../src/tests/api_tests.cpp:6641: Failure > Expected: "echo\n" > To be equal to: stdoutReceived + stderrReceived > Which is: " > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9261) PersistentVolumeTest.ShrinkVolume is flaky
[ https://issues.apache.org/jira/browse/MESOS-9261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628550#comment-16628550 ] Alexander Rukletsov commented on MESOS-9261: I don't see anything in the log that can hint why task {{test `cat path1/file` = abc}} has failed. > PersistentVolumeTest.ShrinkVolume is flaky > -- > > Key: MESOS-9261 > URL: https://issues.apache.org/jira/browse/MESOS-9261 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run: > {noformat} > ../../src/tests/persistent_volume_tests.cpp:832 > Expected: TASK_FINISHED > To be equal to: taskFinished->state() > Which is: TASK_FAILED > {noformat} > Full log: > {noformat} > [ RUN ] DiskResource/PersistentVolumeTest.ShrinkVolume/0 > I0925 23:58:13.544659 21740 cluster.cpp:173] Creating default 'local' > authorizer > I0925 23:58:13.545785 9453 master.cpp:413] Master > 9f8d4b56-de4c-4df6-86d9-92a6c3c9e432 (ip-172-16-10-34.ec2.internal) started > on 172.16.10.34:35358 > I0925 23:58:13.545801 9453 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/tf2SmN/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/tf2SmN/master" --zk_session_timeout="10secs" > I0925 23:58:13.545931 9453 master.cpp:465] Master only allowing > authenticated frameworks to register > I0925 23:58:13.545939 9453 master.cpp:471] Master only allowing > authenticated agents to register > I0925 23:58:13.545945 9453 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0925 23:58:13.545951 9453 credentials.hpp:37] Loading credentials for > authentication from '/tmp/tf2SmN/credentials' > I0925 23:58:13.546041 9453 master.cpp:521] Using default 'crammd5' > authenticator > I0925 23:58:13.546085 9453 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0925 23:58:13.546119 9453 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0925 23:58:13.546149 9453 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0925 23:58:13.546174 9453 master.cpp:602] Authorization enabled > I0925 23:58:13.546268 9457 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0925 23:58:13.546294 9457 whitelist_watcher.cpp:77] No whitelist given > I0925 23:58:13.546878 9458 master.cpp:2083] Elected as the leading master! > I0925 23:58:13.546891 9458 master.cpp:1638] Recovering from registrar > I0925 23:58:13.546941 9453 registrar.cpp:339] Recovering registrar > I0925 23:58:13.547065 9453 registrar.cpp:383] Successfully fetched the > registry (0B) in 0ns > I0925 23:58:13.547092 9453 registrar.cpp:487] Applied 1 operations in > 7135ns; attempting to update the registry > I0925 23:58:13.547225 9453 registrar.cpp:544] Successfully updated the > registry in 0ns > I0925 23:58:13.547250 9453 registrar.cpp:416] Successfully recovered > registrar > I0925 23:58:13.547319 9453 master.cpp:1752] Recovered 0 agents from the > registry (172B); allowing 10mins for agents to reregister > I0925 23:58:13.547336 9457 hierarchical.cpp:220] Skipping recovery of > hierarchical allocator: nothing to recover > W0925 23:58:13.549054 21740 process.cpp:2810] Attempted to spawn already > running process files@172.16.10.34:35358 > I0925 23:58:13.549363 21740 containerizer.cpp:305] Using isolation
[jira] [Commented] (MESOS-9262) ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_DTYPE_Whiteout is flaky
[ https://issues.apache.org/jira/browse/MESOS-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628547#comment-16628547 ] Alexander Rukletsov commented on MESOS-9262: {noformat} E0925 23:59:40.077899 3539 slave.cpp:6162] Container 'c09a3eb9-7d46-4ff0-8b70-ec87d7adf2e2' for executor 'c32a603e-7202-4534-a218-3116d8d5bb34' of framework 34b4dabd-2b7c-4ba6-bccf-4dfa968087a1- failed to start: Collect failed: Failed to perform 'curl': curl: (52) Empty reply from server {noformat} I wonder whether this is related to the recent flavour of MESOS-7425 and why we use weird images like {{zhq527725/whiteout}}? > ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_DTYPE_Whiteout is flaky > --- > > Key: MESOS-9262 > URL: https://issues.apache.org/jira/browse/MESOS-9262 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run (4489): > {noformat} > ../../src/tests/containerizer/provisioner_docker_tests.cpp:915 > Expected: TASK_STARTING > To be equal to: statusStarting->state() > Which is: TASK_FAILED > {noformat} > Full log: > {noformat} > [ RUN ] > BackendFlag/ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_DTYPE_Whiteout/0 > I0925 23:59:24.750632 21740 cluster.cpp:173] Creating default 'local' > authorizer > I0925 23:59:24.752059 3540 master.cpp:413] Master > 34b4dabd-2b7c-4ba6-bccf-4dfa968087a1 (ip-172-16-10-34.ec2.internal) started > on 172.16.10.34:41596 > I0925 23:59:24.752087 3540 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/f5XfyH/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/f5XfyH/master" --zk_session_timeout="10secs" > I0925 23:59:24.752307 3540 master.cpp:465] Master only allowing > authenticated frameworks to register > I0925 23:59:24.752393 3540 master.cpp:471] Master only allowing > authenticated agents to register > I0925 23:59:24.752409 3540 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0925 23:59:24.752418 3540 credentials.hpp:37] Loading credentials for > authentication from '/tmp/f5XfyH/credentials' > I0925 23:59:24.752590 3540 master.cpp:521] Using default 'crammd5' > authenticator > I0925 23:59:24.752715 3540 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0925 23:59:24.752769 3540 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0925 23:59:24.752804 3540 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0925 23:59:24.752835 3540 master.cpp:602] Authorization enabled > I0925 23:59:24.753206 3539 whitelist_watcher.cpp:77] No whitelist given > I0925 23:59:24.753266 3544 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0925 23:59:24.753803 3540 master.cpp:2083] Elected as the leading master! > I0925 23:59:24.753823 3540 master.cpp:1638] Recovering from registrar > I0925 23:59:24.753863 3540 registrar.cpp:339] Recovering registrar > I0925 23:59:24.754007 3540 registrar.cpp:383] Successfully fetched the > registry (0B) in 130048ns > I0925 23:59:24.754041 3540 registrar.cpp:487] Applied 1 operations in > 8734ns; attempting to update the registry > I0925 23:59:24.754166 3540 registrar.cpp:544] Successfully updated the > registry in 108032ns > I0925 23:59:24.754195 3540 registrar.cpp:416]
[jira] [Commented] (MESOS-9264) NestedContainerCniTest.ROOT_INTERNET_CURL_VerifyContainerHostname is flaky
[ https://issues.apache.org/jira/browse/MESOS-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628542#comment-16628542 ] Alexander Rukletsov commented on MESOS-9264: Apparently, {{library/alpine}} could not be fetched in 15s? > NestedContainerCniTest.ROOT_INTERNET_CURL_VerifyContainerHostname is flaky > -- > > Key: MESOS-9264 > URL: https://issues.apache.org/jira/browse/MESOS-9264 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run: (4488) > {noformat} > ../../src/tests/containerizer/cni_isolator_tests.cpp:1969 > Failed to wait 15secs for updateRunning > {noformat} > Full log: > {noformat} > [ RUN ] > JoinParentsNetworkParam/NestedContainerCniTest.ROOT_INTERNET_CURL_VerifyContainerHostname/0 > I0925 22:02:08.400498 11809 cluster.cpp:173] Creating default 'local' > authorizer > I0925 22:02:08.401520 30157 master.cpp:413] Master > d800b4fe-ffe8-4a9c-b6cb-93f9ce4d0c8c (ip-172-16-10-238.ec2.internal) started > on 172.16.10.238:41592 > I0925 22:02:08.401608 30157 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/p8aET3/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/p8aET3/master" --zk_session_timeout="10secs" > I0925 22:02:08.401738 30157 master.cpp:465] Master only allowing > authenticated frameworks to register > I0925 22:02:08.401749 30157 master.cpp:471] Master only allowing > authenticated agents to register > I0925 22:02:08.401756 30157 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0925 22:02:08.401762 30157 credentials.hpp:37] Loading credentials for > authentication from '/tmp/p8aET3/credentials' > I0925 22:02:08.401834 30157 master.cpp:521] Using default 'crammd5' > authenticator > I0925 22:02:08.401882 30157 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0925 22:02:08.401932 30157 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0925 22:02:08.401965 30157 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0925 22:02:08.401998 30157 master.cpp:602] Authorization enabled > I0925 22:02:08.402230 30163 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0925 22:02:08.402434 30163 whitelist_watcher.cpp:77] No whitelist given > I0925 22:02:08.402696 30157 master.cpp:2083] Elected as the leading master! > I0925 22:02:08.402716 30157 master.cpp:1638] Recovering from registrar > I0925 22:02:08.402823 30157 registrar.cpp:339] Recovering registrar > I0925 22:02:08.403005 30157 registrar.cpp:383] Successfully fetched the > registry (0B) in 158208ns > I0925 22:02:08.403045 30157 registrar.cpp:487] Applied 1 operations in > 8612ns; attempting to update the registry > I0925 22:02:08.403218 30156 registrar.cpp:544] Successfully updated the > registry in 128768ns > I0925 22:02:08.403431 30156 registrar.cpp:416] Successfully recovered > registrar > I0925 22:02:08.403694 30157 hierarchical.cpp:220] Skipping recovery of > hierarchical allocator: nothing to recover > I0925 22:02:08.403750 30161 master.cpp:1752] Recovered 0 agents from the > registry (176B); allowing 10mins for agents to reregister > W0925 22:02:08.405280 11809 process.cpp:2810] Attempted to spawn already > running process files@172.16.10.238:41592 > I0925 22:02:08.405745
[jira] [Comment Edited] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604170#comment-16604170 ] Alexander Rukletsov edited comment on MESOS-8096 at 9/25/18 12:23 PM: -- Might be related to this issue, from {{clang-analyzer}}, courtesy of [~mcypark]: {noformat} src/scheduler/scheduler.cpp:911:5: warning: Call to virtual function during destruction will not dispatch to derived class [clang-analyzer-optin.cplusplus.VirtualCall] stop(); ^ {noformat} Likely a hypothetical control flow starting from {{src/tests/http_fault_tolerance_tests.cpp:872}} {noformat} /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5: warning: Use of memory after it is freed [clang-analyzer-cplusplus.NewDelete] return function_mocker_->AddNewExpectation( ^ /tmp/SRC/src/tests/http_fault_tolerance_tests.cpp:872:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*scheduler, connected(_)) ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1845:32: note: expanded from macro 'EXPECT_CALL' #define EXPECT_CALL(obj, call) GMOCK_EXPECT_CALL_IMPL_(obj, call) ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1844:5: note: expanded from macro 'GMOCK_EXPECT_CALL_IMPL_' ((obj).gmock_##call).InternalExpectedAt(__FILE__, __LINE__, #obj, #call) ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12: note: Calling 'FunctionMockerBase::AddNewExpectation' return function_mocker_->AddNewExpectation( ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9: note: Memory is allocated new TypedExpectation(this, file, line, source_text, m); ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:9: note: Assuming 'implicit_sequence' is equal to NULL if (implicit_sequence != NULL) { ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:5: note: Taking false branch if (implicit_sequence != NULL) { ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13: note: Calling '~linked_ptr' return *expectation; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19: note: Calling 'linked_ptr::depart' ~linked_ptr() { depart(); } ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:5: note: Taking true branch if (link_.depart()) delete value_; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:25: note: Memory is released if (link_.depart()) delete value_; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19: note: Returning; memory was released ~linked_ptr() { depart(); } ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13: note: Returning from '~linked_ptr' return *expectation; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12: note: Returning; memory was released return function_mocker_->AddNewExpectation( ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5: note: Use of memory after it is freed return function_mocker_->AddNewExpectation( ^ {noformat} There are what seems to be equivalent output for the following places: {noformat} /tmp/SRC/src/tests/uri_fetcher_tests.cpp:140:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(server, test(_)) ^ {noformat} {noformat} /tmp/SRC/src/tests/default_executor_tests.cpp:2042:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*scheduler, connected(_)) ^ {noformat} {noformat} /tmp/SRC/src/tests/scheduler_tests.cpp:2037:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*scheduler, connected(_)) ^ {noformat} {noformat} /tmp/SRC/src/tests/fetcher_tests.cpp:535:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*http.process, test(_)) ^ {noformat} Of all the {{EXPECT_CALL}} s in the codebase, these are the only instances that are pointed out. It is still unclear that there's an issue here, but it seems worth checking out, especially since these files are known-flaky. was (Author: alexr): Might be related to this issue, from {{clang-analyzer}},
[jira] [Assigned] (MESOS-1719) Master should persist framework information
[ https://issues.apache.org/jira/browse/MESOS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-1719: -- Assignee: (was: Yongqiao Wang) > Master should persist framework information > --- > > Key: MESOS-1719 > URL: https://issues.apache.org/jira/browse/MESOS-1719 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Vinod Kone >Priority: Major > Labels: mesosphere, reliability > > https://issues.apache.org/jira/browse/MESOS-1219 disallows completed > frameworks from re-registering with the same framework id, as long as the > master doesn't failover. > This ticket tracks the work for it work across the master failover using > registrar. > There are some open questions that need to be addressed: > --> Should registry contain framework ids only framework infos. > For disallowing completed frameworks from re-registering, persisting > framework ids is enough. But, if in the future, we want to disallow > frameworks from re-registering if some parts of framework info > changed then we need to persist the info too. > --> How to update the framework info. > Currently frameworks are allowed to update framework info while re- > registering, but it only takes effect on the master when the master > fails > over and on the slave when the slave fails over. How should things >change when persist framework info? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov edited comment on MESOS-8545 at 9/21/18 1:01 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit bfa2bd24780b5c49467b3c23260855e3d8b4c948 Author: Andrei Budnik AuthorDate: Fri Sep 21 14:51:24 2018 +0200 Commit: Alexander Rukletsov CommitDate: Fri Sep 21 14:51:24 2018 +0200 Fixed disconnection while sending acknowledgment to IOSwitchboard. Previously, an HTTP connection to the IOSwitchboard could be garbage collected before the agent sent an acknowledgment to the IOSwitchboard via this connection. This patch fixes the issue by keeping a reference count to the connection in a lambda callback until disconnection occurs. Review: https://reviews.apache.org/r/68768/ {noformat} {noformat} commit c3c77cbef818d497d8bd5e67fa72e55a7190e27a Author: Andrei Budnik AuthorDate: Fri Sep 21 14:51:59 2018 +0200 Commit: Alexander Rukletsov CommitDate: Fri Sep 21 14:51:59 2018 +0200 Fixed broken pipe error in IOSwitchboard. Previous attempt to fix `HTTP 500` "broken pipe" in review /r/62187/ was not correct: after IOSwitchboard sends a response to the agent for the `ATTACH_CONTAINER_INPUT` call, the socket is closed immediately, thus causing the error on the agent. This patch adds a delay after IO redirects are finished and before IOSwitchboard forcibly send a response. Review: https://reviews.apache.org/r/68784/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 commit 021a8f4de1ad65167946548e3ecfa74d8e41e9c5 commit 38a914398b6f1aaf08db4f62f4e42cdb80127eb5 {noformat} *{{1.6.2}}*: {noformat} commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303 commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac commit 55b0e94f0c8a1896ca079361d89527123faf22c6 commit c40c92b7710b5b238b13ce6f1bacd3d75e04283b {noformat} *{{1.5.2}}*: {noformat} commit 3bf4fe22e0ed828a36d5b2ea652d07c6eef4b578 commit 33a6bec95b44592d626874ae8deaa3e2a3bbc120 commit 7b8195680104c2c5f61073a956f60ac961c37f45 commit 0216002744517a6053fd782b6b4dc3d6cf77dd5e {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov edited comment on MESOS-8545 at 9/21/18 12:56 PM: -- *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit bfa2bd24780b5c49467b3c23260855e3d8b4c948 Author: Andrei Budnik AuthorDate: Fri Sep 21 14:51:24 2018 +0200 Commit: Alexander Rukletsov CommitDate: Fri Sep 21 14:51:24 2018 +0200 Fixed disconnection while sending acknowledgment to IOSwitchboard. Previously, an HTTP connection to the IOSwitchboard could be garbage collected before the agent sent an acknowledgment to the IOSwitchboard via this connection. This patch fixes the issue by keeping a reference count to the connection in a lambda callback until disconnection occurs. Review: https://reviews.apache.org/r/68768/ {noformat} {noformat} commit c3c77cbef818d497d8bd5e67fa72e55a7190e27a Author: Andrei Budnik AuthorDate: Fri Sep 21 14:51:59 2018 +0200 Commit: Alexander Rukletsov CommitDate: Fri Sep 21 14:51:59 2018 +0200 Fixed broken pipe error in IOSwitchboard. Previous attempt to fix `HTTP 500` "broken pipe" in review /r/62187/ was not correct: after IOSwitchboard sends a response to the agent for the `ATTACH_CONTAINER_INPUT` call, the socket is closed immediately, thus causing the error on the agent. This patch adds a delay after IO redirects are finished and before IOSwitchboard forcibly send a response. Review: https://reviews.apache.org/r/68784/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 {noformat} *{{1.6.2}}*: {noformat} commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303 commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac {noformat} *{{1.5.2}}*: {noformat} commit 3bf4fe22e0ed828a36d5b2ea652d07c6eef4b578 commit 33a6bec95b44592d626874ae8deaa3e2a3bbc120 {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks.
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415 ] Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 6:14 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 commit f672afef601c71d69a9eb4db3c191bacfe167d3e commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 {noformat} *{{1.6.2}}*: {noformat} commit e3a9eb3b473a10f210913d568c1d9923ed05d933 commit a1798ae1fb2249280f4a4e9fec69eb9e37b95452 commit d82177d00a4a25d70aab172a91c855ad6b07f768 {noformat} *{{1.5.2}}*: {noformat} commit 5a5089938f13a5aafc0a4ee3308f33e76374c408 commit 25de60746de4681ed0d858cba0790372f03ff840 commit fa6eb85fd2a8798842855628495c16664bc68652 {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 commit f672afef601c71d69a9eb4db3c191bacfe167d3e commit
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 6:14 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 {noformat} *{{1.6.2}}*: {noformat} commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303 commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac {noformat} *{{1.5.2}}*: {noformat} commit 3bf4fe22e0ed828a36d5b2ea652d07c6eef4b578 commit 33a6bec95b44592d626874ae8deaa3e2a3bbc120 {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 5:58 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 {noformat} *{{1.6.2}}*: {noformat} commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303 commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 {noformat} > AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky. >
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks.
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415 ] Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:57 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 commit f672afef601c71d69a9eb4db3c191bacfe167d3e commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 {noformat} *{{1.6.2}}*: {noformat} commit e3a9eb3b473a10f210913d568c1d9923ed05d933 commit a1798ae1fb2249280f4a4e9fec69eb9e37b95452 commit d82177d00a4a25d70aab172a91c855ad6b07f768 {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 commit f672afef601c71d69a9eb4db3c191bacfe167d3e commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 {noformat} > Health checks launching nested containers while a container is being > destroyed lead to unkillable tasks. >
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415 ] Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:44 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 commit f672afef601c71d69a9eb4db3c191bacfe167d3e commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 5:44 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 5:43 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} *{{1.7.1}}*: {noformat} commit 1672941630960cccf66ed81b11811d84e8a4e3f0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ (cherry picked from commit 5b95bb0f21852058d22703385f2c8e139881bf1a) {noformat} {noformat} commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:20 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed broken pipe error in IOSwitchboard. We force IOSwitchboard to return a final response to the client for the `ATTACH_CONTAINER_INPUT` call after IO redirects are finished. In this case, we don't read remaining messages from the input stream. So the agent might send an acknowledgment for the request before IOSwitchboard has received remaining messages. We need to delay termination of IOSwitchboard to give it a chance to read the remaining messages. Otherwise, the agent might get `HTTP 500` "broken pipe" while attempting to write the final message. Review: https://reviews.apache.org/r/62187/ (cherry picked from commit c5cf4d49f47579b5a6cb7afc2f7df7c8f51dc6d0) {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415 ] Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:42 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ (cherry picked from commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0) {noformat} {noformat} commit f672afef601c71d69a9eb4db3c191bacfe167d3e Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ (cherry picked from commit e941d206f651bde861675a6517a89e44d1f61a34) {noformat} {noformat} commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ (cherry picked from commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4) {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415 ] Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:41 PM: - *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{master}} aka {{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ (cherry picked from commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0) {noformat} {noformat} commit f672afef601c71d69a9eb4db3c191bacfe167d3e Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ (cherry picked from commit e941d206f651bde861675a6517a89e44d1f61a34) {noformat} {noformat} commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:27:17 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ (cherry picked from commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4) {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
[jira] [Commented] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417 ] Alexander Rukletsov commented on MESOS-8545: *{{master}} aka {{1.8-dev}}*: {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} {noformat} commit 5b95bb0f21852058d22703385f2c8e139881bf1a Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:14 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:14 2018 +0200 Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard. Previously, IOSwitchboard process could terminate before all HTTP responses had been sent to the agent. In the case of `ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK` response, so the agent got broken HTTP connection for the call. This patch introduces an acknowledgment for the received response for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new type of control messages for the `ATTACH_CONTAINER_INPUT` call. When IOSwitchboard receives an acknowledgment, and io redirects are finished, it terminates itself. That guarantees that the agent always receives a response for the `ATTACH_CONTAINER_INPUT` call. Review: https://reviews.apache.org/r/65168/ {noformat} > AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky. > --- > > Key: MESOS-8545 > URL: https://issues.apache.org/jira/browse/MESOS-8545 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.5.0, 1.6.1, 1.7.0 >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: Mesosphere, flaky-test > Attachments: > AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, > AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt > > > {code:java} > I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server > Error' for '/slave(974)/api/v1' (Disconnected) > /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596: > Failure > Value of: (response).get().status > Actual: "500 Internal Server Error" > Expected: http::OK().status > Which is: "200 OK" > Body: "Disconnected" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415 ] Alexander Rukletsov commented on MESOS-9131: *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} > Health checks launching nested containers while a container is being > destroyed lead to unkillable tasks > --- > > Key: MESOS-9131 > URL: https://issues.apache.org/jira/browse/MESOS-9131 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.5.1 >Reporter: Jan Schlicht >Assignee: Andrei Budnik >Priority: Blocker > Labels: container-stuck > Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0 > > > A container might get stuck in {{DESTROYING}} state if there's a command > health check that starts new nested containers while its parent container is > getting destroyed. > Here are some logs which unrelated lines removed. The > `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping > afterwards. > {noformat} > 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807] > Container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has > exited > 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354] > Destroying container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in > RUNNING state > 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968] > Transitioning the state of container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 > from RUNNING to DESTROYING > 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514] > Asked to destroy container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560] > Using freezer to destroy cgroup > mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 3.814144ms > 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing > cgroup >
[jira] [Commented] (MESOS-9241) Delimiters in endpoint names are inconsistent across mesos components.
[ https://issues.apache.org/jira/browse/MESOS-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619185#comment-16619185 ] Alexander Rukletsov commented on MESOS-9241: A brief searching reveals that [there are arguments for both|https://stackoverflow.com/questions/10302179/hyphen-underscore-or-camelcase-as-word-delimiter-in-uris], however for REST and REST-like APIs underscore {{_}} seems the standard de facto: https://api.stripe.com/v1/subscription_items https://developer.twitter.com/en/docs/api-reference-index.html https://www.graph.facebook.com///finance_permissions?user=_permission= Hence the suggestion is to standardise on {{_}} in Mesos. > Delimiters in endpoint names are inconsistent across mesos components. > -- > > Key: MESOS-9241 > URL: https://issues.apache.org/jira/browse/MESOS-9241 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Alexander Rukletsov >Priority: Minor > Labels: api, tech-debt > > At the moment endpoints in Mesos components have both {{-}} and {{_}} as > delimiters: > {noformat} > /master/create-volumes > /master/destroy-volumes > /master/state-summary > /slave(1)/api/v1/resource_provider > {noformat} > This is inconsistency for no good reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9241) Delimiters in endpoint names are inconsistent across mesos components.
Alexander Rukletsov created MESOS-9241: -- Summary: Delimiters in endpoint names are inconsistent across mesos components. Key: MESOS-9241 URL: https://issues.apache.org/jira/browse/MESOS-9241 Project: Mesos Issue Type: Improvement Components: HTTP API Reporter: Alexander Rukletsov At the moment endpoints in Mesos components have both {{-}} and {{_}} as delimiters: {noformat} /master/create-volumes /master/destroy-volumes /master/state-summary /slave(1)/api/v1/resource_provider {noformat} This is inconsistency for no good reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7121) Make IO Switchboard optional for debug containers
[ https://issues.apache.org/jira/browse/MESOS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7121: -- Shepherd: Alexander Rukletsov Assignee: Andrei Budnik Sprint: Mesosphere Sprint 2018-29 Story Points: 5 > Make IO Switchboard optional for debug containers > - > > Key: MESOS-7121 > URL: https://issues.apache.org/jira/browse/MESOS-7121 > Project: Mesos > Issue Type: Improvement >Reporter: Gastón Kleiman >Assignee: Andrei Budnik >Priority: Major > Labels: debugging, health-check, mesosphere, performance > > Starting a new IO switchboard for each debug container adds some overhead. > The functionality provided by the IO switchboard is not always necessary, so > we should make the IO switchboard optional in order to improve the > performance of launching nested containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8975) Problem and solution overview for the slow API issue.
[ https://issues.apache.org/jira/browse/MESOS-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-8975: -- Shepherd: Alexander Rukletsov Assignee: Benno Evers (was: Alexander Rukletsov) > Problem and solution overview for the slow API issue. > - > > Key: MESOS-8975 > URL: https://issues.apache.org/jira/browse/MESOS-8975 > Project: Mesos > Issue Type: Task > Components: HTTP API >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: performance > > Collect data from the clusters regarding {{state.json}} responsiveness, > figure out, where the bottlenecks are, and prepare an overview of solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9224) De-duplicate read-only requests to master based on principal.
Alexander Rukletsov created MESOS-9224: -- Summary: De-duplicate read-only requests to master based on principal. Key: MESOS-9224 URL: https://issues.apache.org/jira/browse/MESOS-9224 Project: Mesos Issue Type: Improvement Components: HTTP API Reporter: Alexander Rukletsov Assignee: Benno Evers "Identical" read-only requests can be batched and answered together. With batching available (MESOS-9158), we can now deduplicate requests based on principal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9189) Include 'Connection: close' header in master streaming API responses.
[ https://issues.apache.org/jira/browse/MESOS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608986#comment-16608986 ] Alexander Rukletsov commented on MESOS-9189: This is still in {{master}}. Is it on purpose [~gkleiman], [~bmahler]? > Include 'Connection: close' header in master streaming API responses. > - > > Key: MESOS-9189 > URL: https://issues.apache.org/jira/browse/MESOS-9189 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Attachments: bad_run.txt, good_run.txt > > > We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections > to mesos as an optimization to avoid re-connection overhead. As a result, > when the end-client of the streaming API disconnects from the intermediary, > the intermediary leaves the connection to mesos open in an attempt to re-use > the connection for another request once the response completes. Mesos then > thinks that the subscriber never disconnected and the intermediary happily > continues to read the streaming events even though there's no end-client. > To help indicate to intermediaries that the connection SHOULD NOT be re-used, > we can set the 'Connection: close' header for streaming API responses. It may > not be respected (since the language seems to be SHOULD NOT), but some > intermediaries may respect it and close the connection if the end-client > disconnects. > Note that libprocess' http server currently doesn't close the the connection > based on a handler setting this header, but it doesn't matter here since the > streaming API responses are infinite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9194) Extend request batching to '/roles' endpoint
[ https://issues.apache.org/jira/browse/MESOS-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-9194: -- Assignee: Benno Evers Sprint: Mesosphere Sprint 2018-28 Story Points: 3 Labels: mesosphere (was: ) Fix Version/s: 1.8.0 > Extend request batching to '/roles' endpoint > > > Key: MESOS-9194 > URL: https://issues.apache.org/jira/browse/MESOS-9194 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.8.0 > > > For consistency and improved performance under load, the `/roles` endpoint > should use the same request batching mechanism as `/state`, '/tasks`, ... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.
[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586014#comment-16586014 ] Alexander Rukletsov edited comment on MESOS-9116 at 9/6/18 11:44 AM: - {noformat} commit d95a16e03d27a2b6575148183e53a3b4507a16c1 Author: Andrei Budnik AuthorDate: Mon Aug 20 16:22:33 2018 +0200 Commit: Alexander Rukletsov CommitDate: Mon Aug 20 16:22:33 2018 +0200 Added `LaunchNestedContainerSessionInParallel` test. This patch adds a test which verifies that launching multiple short-lived nested container sessions succeeds. This test implicitly verifies that agent correctly detects `mnt` namespace of a command executor's task. If the detection fails, the containerizer launcher (aka `nanny`) process fails to enter `mnt` namespace, so it prints an error message into stderr for this nested container. This test is disabled until we fix MESOS-8545. Review: https://reviews.apache.org/r/68256/ {noformat} {noformat} commit e78f636d84f2709da17275f7d70265520c0f4f94 Author: Andrei Budnik AuthorDate: Mon Aug 20 16:28:31 2018 +0200 Commit: Alexander Rukletsov CommitDate: Mon Aug 20 16:28:31 2018 +0200 Fixed incorrect `mnt` namespace detection of command executor's task. Previously, we were walking the process tree from the container's `init` process to find the first process along the way whose `mnt` namespace differs from the `init` process. We expected this algorithm to always return the PID of the command executor's task. However, if someone launches multiple nested containers within the process tree, the aforementioned algorithm might detect the PID of one of those nested container instead of the command executor's task. Even though the `mnt` namespace will be the same across all these candidates, the detected PID might belong to a short-lived container, which might terminate before the containerizer launcher (aka `nanny` process) tries to enter its `mnt` namespace. This patch fixes the detection algorithm so that it always returns the PID of the command executor's task. Review: https://reviews.apache.org/r/68257/ {noformat} {noformat} commit 31499a5dc1de29fa2178e6ea9e5398d8c668a933 Author: Andrei Budnik AuthorDate: Mon Aug 20 16:28:38 2018 +0200 Commit: Alexander Rukletsov CommitDate: Mon Aug 20 16:28:38 2018 +0200 Added `ROOT_CGROUPS_LaunchNestedDebugAfterUnshareMntNamespace` test. This test verifies detection of task's `mnt` namespace for a debug nested container. Debug nested container must enter `mnt` namespace of the task, so the agent tries to detect task's `mnt` namespace. This test launches a long-running task which runs a subtask that unshares `mnt` namespace. The structure of the resulting process tree is similar to the process tree of the command executor (the task of the command executor unshares `mnt` ns): 0. root (aka "nanny"/"launcher" process) [root `mnt` namespace] 1. task: sleep 1000 [root `mnt` namespace] 2. subtaks: sleep 1000 [subtask's `mnt` namespace] We expect that the agent detects task's `mnt` namespace. Review: https://reviews.apache.org/r/68408/ {noformat} {noformat} commit b3c9c6939964831170e819f88134af7b275ffe1b Author: Andrei Budnik AuthorDate: Mon Aug 20 16:28:44 2018 +0200 Commit: Alexander Rukletsov CommitDate: Mon Aug 20 16:28:44 2018 +0200 Fixed wrong `mnt` namespace detection for non-command executor tasks. Previously, we were calling `getMountNamespaceTarget()` not only in case of the command executor but in all other cases too, including the default executor. That might lead to various subtle bugs, caused by wrong detection of `mnt` namespace target. This patch fixes the issue by setting a parent PID as `mnt` namespace target in case of non-command executor task. Review: https://reviews.apache.org/r/68348/ {noformat} {noformat} commit 52be35f47caea2712a0b13d7f963f7236533a2f1 Author: Andrei Budnik AuthorDate: Thu Sep 6 13:41:06 2018 +0200 Commit: Alexander Rukletsov CommitDate: Thu Sep 6 13:41:06 2018 +0200 Fixed `LaunchNestedContainerSessionsInParallel` test. Previously, we sent `ATTACH_CONTAINER_OUTPUT` to attach to a short-living nested container. An attempt to attach to a terminated nested container leads to HTTP 500 error. This patch gets rid of `ATTACH_CONTAINER_OUTPUT` in favor of `LAUNCH_NESTED_CONTAINER_SESSION` so that we can read the container's output without using an extra call. Review: https://reviews.apache.org/r/68236/ {noformat} was (Author: alexr): {noformat} commit d95a16e03d27a2b6575148183e53a3b4507a16c1 Author: Andrei Budnik AuthorDate: Mon Aug 20 16:22:33 2018 +0200 Commit:
[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604170#comment-16604170 ] Alexander Rukletsov commented on MESOS-8096: Might be related to this issue, from {{clang-analyzer}}, courtesy [~mcypark]: {noformat} src/scheduler/scheduler.cpp:911:5: warning: Call to virtual function during destruction will not dispatch to derived class [clang-analyzer-optin.cplusplus.VirtualCall] stop(); ^ {noformat} Likely a hypothetical control flow starting from {{src/tests/http_fault_tolerance_tests.cpp:872}} {noformat} /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5: warning: Use of memory after it is freed [clang-analyzer-cplusplus.NewDelete] return function_mocker_->AddNewExpectation( ^ /tmp/SRC/src/tests/http_fault_tolerance_tests.cpp:872:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*scheduler, connected(_)) ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1845:32: note: expanded from macro 'EXPECT_CALL' #define EXPECT_CALL(obj, call) GMOCK_EXPECT_CALL_IMPL_(obj, call) ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1844:5: note: expanded from macro 'GMOCK_EXPECT_CALL_IMPL_' ((obj).gmock_##call).InternalExpectedAt(__FILE__, __LINE__, #obj, #call) ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12: note: Calling 'FunctionMockerBase::AddNewExpectation' return function_mocker_->AddNewExpectation( ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9: note: Memory is allocated new TypedExpectation(this, file, line, source_text, m); ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:9: note: Assuming 'implicit_sequence' is equal to NULL if (implicit_sequence != NULL) { ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:5: note: Taking false branch if (implicit_sequence != NULL) { ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13: note: Calling '~linked_ptr' return *expectation; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19: note: Calling 'linked_ptr::depart' ~linked_ptr() { depart(); } ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:5: note: Taking true branch if (link_.depart()) delete value_; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:25: note: Memory is released if (link_.depart()) delete value_; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19: note: Returning; memory was released ~linked_ptr() { depart(); } ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13: note: Returning from '~linked_ptr' return *expectation; ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12: note: Returning; memory was released return function_mocker_->AddNewExpectation( ^ /BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5: note: Use of memory after it is freed return function_mocker_->AddNewExpectation( ^ {noformat} There are what seems to be equivalent output for the following places: {noformat} /tmp/SRC/src/tests/uri_fetcher_tests.cpp:140:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(server, test(_)) ^ {noformat} {noformat} /tmp/SRC/src/tests/default_executor_tests.cpp:2042:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*scheduler, connected(_)) ^ {noformat} {noformat} /tmp/SRC/src/tests/scheduler_tests.cpp:2037:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*scheduler, connected(_)) ^ {noformat} {noformat} /tmp/SRC/src/tests/fetcher_tests.cpp:535:3: note: Calling 'MockSpec::InternalExpectedAt' EXPECT_CALL(*http.process, test(_)) ^ {noformat} Of all the {{EXPECT_CALL}} s in the codebase, these are the only instances that are pointed out. It is still unclear that there's an issue here, but it seems worth checking out, especially since these files are known-flaky. > Enqueueing events in MockHTTPScheduler can lead to segfaults. > - > >
[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.
[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276 ] Alexander Rukletsov edited comment on MESOS-9116 at 9/3/18 10:09 AM: - Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} Backports to 1.4.x (partial): {noformat} c37eb59e4c4b7b6c16509f317c78207da6eeb485 {noformat} was (Author: alexr): Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} Backports to 1.4.x: {noformat} c37eb59e4c4b7b6c16509f317c78207da6eeb485 {noformat} > Launch nested container session fails due to incorrect detection of `mnt` > namespace of command executor's task. > --- > > Key: MESOS-9116 > URL: https://issues.apache.org/jira/browse/MESOS-9116 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Critical > Labels: mesosphere > Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.0 > > Attachments: pstree.png > > > Launch nested container call might fail with the following error: > {code:java} > Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such > file or directory > {code} > This happens when the containerizer launcher [tries to > enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892] > `mnt` namespace using the pid of a terminated process. The pid [was > detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958] > by the agent before spawning the containerizer launcher process, because the > process was running back then. > The issue can be reproduced using the following test (pseudocode): > {code:java} > launchTask("sleep 1000") > parentContainerId = containerizer.containers().begin() > outputs = [] > for i in range(10): > ContainerId containerId > containerId.parent = parentContainerId > containerId.id = UUID.random() > LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo") > response = ATTACH_CONTAINER_OUTPUT(containerId) > outputs.append(response.reader) > for output in outputs: > stdout, stderr = getProcessIOData(output) > assert("echo" == stdout + stderr){code} > When we start the very first nested container, `getMountNamespaceTarget()` > returns a PID of the task (`sleep 1000`), because it's the only process whose > `mnt` namespace differs from the parent container. This nested container > becomes a child of PID 1 process, which is also a parent of the command > executor. It's not an executor's child! It can be seen in attached > `pstree.png`. > When we start a second nested container, `getMountNamespaceTarget()` might > return PID of the previous nested container (`echo echo`) instead of the > task's PID (`sleep 1000`). It happens because the first nested container > entered `mnt` namespace of the task. Then, the containerizer launcher > ("nanny" process) attempts to enter `mnt` namespace using the PID of a > terminated process, so we get this error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.
[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276 ] Alexander Rukletsov edited comment on MESOS-9116 at 8/31/18 3:15 PM: - Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} Backports to 1.4.x: {noformat} c37eb59e4c4b7b6c16509f317c78207da6eeb485 {noformat} was (Author: alexr): Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} Backports to 1.4.x: {noformat} c37eb59e4c4b7b6c16509f317c78207da6eeb485 05ec5d1770aeda25b4995487e40f690fe8fa6b19 {noformat} > Launch nested container session fails due to incorrect detection of `mnt` > namespace of command executor's task. > --- > > Key: MESOS-9116 > URL: https://issues.apache.org/jira/browse/MESOS-9116 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Critical > Labels: mesosphere > Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.0 > > Attachments: pstree.png > > > Launch nested container call might fail with the following error: > {code:java} > Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such > file or directory > {code} > This happens when the containerizer launcher [tries to > enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892] > `mnt` namespace using the pid of a terminated process. The pid [was > detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958] > by the agent before spawning the containerizer launcher process, because the > process was running back then. > The issue can be reproduced using the following test (pseudocode): > {code:java} > launchTask("sleep 1000") > parentContainerId = containerizer.containers().begin() > outputs = [] > for i in range(10): > ContainerId containerId > containerId.parent = parentContainerId > containerId.id = UUID.random() > LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo") > response = ATTACH_CONTAINER_OUTPUT(containerId) > outputs.append(response.reader) > for output in outputs: > stdout, stderr = getProcessIOData(output) > assert("echo" == stdout + stderr){code} > When we start the very first nested container, `getMountNamespaceTarget()` > returns a PID of the task (`sleep 1000`), because it's the only process whose > `mnt` namespace differs from the parent container. This nested container > becomes a child of PID 1 process, which is also a parent of the command > executor. It's not an executor's child! It can be seen in attached > `pstree.png`. > When we start a second nested container, `getMountNamespaceTarget()` might > return PID of the previous nested container (`echo echo`) instead of the > task's PID (`sleep 1000`). It happens because the first nested container > entered `mnt` namespace of the task. Then, the containerizer launcher > ("nanny" process) attempts to enter `mnt` namespace using the PID of a > terminated process, so we get this error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.
[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276 ] Alexander Rukletsov edited comment on MESOS-9116 at 8/31/18 11:56 AM: -- Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} Backports to 1.4.x: {noformat} c37eb59e4c4b7b6c16509f317c78207da6eeb485 05ec5d1770aeda25b4995487e40f690fe8fa6b19 {noformat} was (Author: alexr): Backports to 1.6.x: {noformat} cfba574408a85861d424a2c58d3d7277490c398e 6d884fbf9be169fd97483a1f341540c5354d88a9 a4409826deada53eef8843df1a0178e9edfa4c9c 20a4d4fae2f30f9e5436a154087c1a1bb9dc0629 {noformat} Backports to 1.5.x: {noformat} 6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81 edeac7b0da5dd7ee1e4e50320d964eb84220d87d 966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97 e4d8ab9911af6d494aae7f5762dd84b8f085fd1e {noformat} > Launch nested container session fails due to incorrect detection of `mnt` > namespace of command executor's task. > --- > > Key: MESOS-9116 > URL: https://issues.apache.org/jira/browse/MESOS-9116 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Critical > Labels: mesosphere > Fix For: 1.5.2, 1.6.2, 1.7.0 > > Attachments: pstree.png > > > Launch nested container call might fail with the following error: > {code:java} > Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such > file or directory > {code} > This happens when the containerizer launcher [tries to > enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892] > `mnt` namespace using the pid of a terminated process. The pid [was > detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958] > by the agent before spawning the containerizer launcher process, because the > process was running back then. > The issue can be reproduced using the following test (pseudocode): > {code:java} > launchTask("sleep 1000") > parentContainerId = containerizer.containers().begin() > outputs = [] > for i in range(10): > ContainerId containerId > containerId.parent = parentContainerId > containerId.id = UUID.random() > LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo") > response = ATTACH_CONTAINER_OUTPUT(containerId) > outputs.append(response.reader) > for output in outputs: > stdout, stderr = getProcessIOData(output) > assert("echo" == stdout + stderr){code} > When we start the very first nested container, `getMountNamespaceTarget()` > returns a PID of the task (`sleep 1000`), because it's the only process whose > `mnt` namespace differs from the parent container. This nested container > becomes a child of PID 1 process, which is also a parent of the command > executor. It's not an executor's child! It can be seen in attached > `pstree.png`. > When we start a second nested container, `getMountNamespaceTarget()` might > return PID of the previous nested container (`echo echo`) instead of the > task's PID (`sleep 1000`). It happens because the first nested container > entered `mnt` namespace of the task. Then, the containerizer launcher > ("nanny" process) attempts to enter `mnt` namespace using the PID of a > terminated process, so we get this error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7076) libprocess tests fail when using libevent 2.1.8
[ https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598433#comment-16598433 ] Alexander Rukletsov commented on MESOS-7076: Original libevent-ML thread: http://archives.seul.org/libevent/users/Feb-2018/msg3.html Follow-up from Till: http://archives.seul.org/libevent/users/Aug-2018/msg9.html > libprocess tests fail when using libevent 2.1.8 > --- > > Key: MESOS-7076 > URL: https://issues.apache.org/jira/browse/MESOS-7076 > Project: Mesos > Issue Type: Bug > Components: build, libprocess, test > Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew) >Reporter: Jan Schlicht >Assignee: Till Toenshoff >Priority: Critical > Labels: ci > Attachments: libevent-openssl11.patch > > > Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent > --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests > fail like > {noformat} > [ RUN ] SSLTest.SSLSocket > I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is > unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE= > I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path > unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR= > I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer > certificate! > NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer > certificate if presented! > NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate > verification > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is > unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE= > I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path > unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR= > I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer > certificate! > NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer > certificate if presented! > NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate > verification > ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure > Failed to wait 15secs for Socket(socket.get()).recv() > [ FAILED ] SSLTest.SSLSocket (15196 ms) > {noformat} > Tests failing are > {noformat} > SSLTest.SSLSocket > SSLTest.NoVerifyBadCA > SSLTest.VerifyCertificate > SSLTest.ProtocolMismatch > SSLTest.ECDHESupport > SSLTest.PeerAddress > SSLTest.HTTPSGet > SSLTest.HTTPSPost > SSLTest.SilentSocket > SSLTest.ShutdownThenSend > SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false" > SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true" > SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false" > SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true" > SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false" > SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true" > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258 ] Alexander Rukletsov edited comment on MESOS-8545 at 8/29/18 3:01 PM: - When the agent handles {{ATTACH_CONTAINER_INPUT}} call, it creates an HTTP [streaming connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104] to IOSwitchboard. After the agent [sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141] a request to IOSwitchboard, a new instance of {{ConnectionProcess}} is created, which calls [{{ConnectionProcess::read()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220] to read an HTTP response from IOSwitchboard. If the socket is closed before a `\r\n\r\n` response is received, the {{ConnectionProcess}} calls `[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`, which in turn [flushes `pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201] containing a {{Response}} promise. This leads to responding back (to the {{AttachInputToNestedContainerSession}} [test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943]) an {{HTTP 500}} error with body "Disconnected". When io redirect finishes, IOSwitchboardServerProcess calls {{terminate(self(), false)}} (here [\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262] or there [\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]). Then, {{IOSwitchboardServerProcess::finalize()}} sets a value to the [`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308], which [unblocks {{main()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150] function. As a result, IOSwitchboard process terminates immediately. When IOSwitchboard terminates, there could be not yet [written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699] response messages to the socket. So, if any delay occurs before [sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748] the response back to the agent, the socket will be closed due to IOSwitchboard process termination. That leads to the aforementioned premature socket close in the agent. See my previous comment which includes steps to reproduce the bug. was (Author: abudnik): When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP [streaming connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104] to IOSwitchboard. After the agent [sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141] a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, which calls [`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220] to read an HTTP response from IOSwitchboard. If the socket is closed before a `\r\n\r\n` response is received, the `ConnectionProcess` calls `[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`, which in turn [flushes `pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201] containing a `Response` promise. This leads to responding back (to the `AttachInputToNestedContainerSession` [test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943]) an `HTTP 500` error with body "Disconnected". When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), false)` (here [\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262] or there [\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]). Then, `IOSwitchboardServerProcess::finalize()` sets a value to the
[jira] [Assigned] (MESOS-4233) Logging is too verbose for sysadmins / syslog
[ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-4233: -- Assignee: (was: Kapil Arya) > Logging is too verbose for sysadmins / syslog > - > > Key: MESOS-4233 > URL: https://issues.apache.org/jira/browse/MESOS-4233 > Project: Mesos > Issue Type: Epic >Reporter: Cody Maloney >Priority: Major > Labels: mesosphere > Attachments: giant_port_range_logging > > > Currently mesos logs a lot. When launching a thousand tasks in the space of > 10 seconds it will print tens of thousands of log lines, overwhelming syslog > (there is a max rate at which a process can send stuff over a unix socket) > and not giving useful information to a sysadmin who cares about just the > high-level activity and when something goes wrong. > Note mesos also blocks writing to its log locations, so when writing a lot of > log messages, it can fill up the write buffer in the kernel, and be suspended > until the syslog agent catches up reading from the socket (GLOG does a > blocking fwrite to stderr). GLOG also has a big mutex around logging so only > one thing logs at a time. > While for "internal debugging" it is useful to see things like "message went > from internal compoent x to internal component y", from a sysadmin > perspective I only care about the high level actions taken (launched task for > framework x), sent offer to framework y, got task failed from host z. Note > those are what I'd expect at the "INFO" level. At the "WARNING" level I'd > expect very little to be logged / almost nothing in normal operation. Just > things like "WARN: Repliacted log write took longer than expected". WARN > would also get things like backtraces on crashes and abnormal exits / abort. > When trying to launch 3k+ tasks inside a second, mesos logging currently > overwhelms syslog with 100k+ messages, many of which are thousands of bytes. > Sysadmins expect to be able to use syslog to monitor basic events in their > system. This is too much. > We can keep logging the messages to files, but the logging to stderr needs to > be reduced significantly (stderr gets picked up and forwarded to syslog / > central aggregation). > What I would like is if I can set the stderr logging level to be different / > independent from the file logging level (Syslog giving the "sysadmin" > aggregated overview, files useful for debugging in depth what happened in a > cluster). A lot of what mesos currently logs at info is really debugging info > / should show up as debug log level. > Some samples of mesos logging a lot more than a sysadmin would want / expect > are attached, and some are below: > - Every task gets printed multiple times for a basic launch: > {noformat} > Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: > I1215 22:58:29.382644 1315 master.cpp:3248] Launching task > envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework > 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon) > Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: > I1215 22:58:29.382925 1315 master.hpp:176] Adding task > envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; > mem(*):16; ports(*):[14047-14047] > {noformat} > - Every task status update prints many log lines, successful ones are part > of normal operation and maybe should be logged at info / debug levels, but > not to a sysadmin (Just show when things fail, and maybe aggregate counters > to tell of the volume of working) > - No log messagse should be really big / more than 1k characters (Would > prevent the giant port list attached, make that easily discoverable / bug > filable / fixable) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9189) Include 'Connection: close' header in streaming API responses.
[ https://issues.apache.org/jira/browse/MESOS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596183#comment-16596183 ] Alexander Rukletsov commented on MESOS-9189: I'm not sure I understand how the change is supposed to help. {{'Connection: close'}} set by a server is an indicator for the client to close the connection _after_ receiveng the complete response. AFAIK, we don't ever complete the streaming response in Mesos and there is no way for Mesos to somehow understand that an end client might not be interested in the stream any more and send an empty chunk. From a middleman's point of view the actual value of the {{'Connection'}} header is only interesting _after_ the response is completed, i.e., an empty chunk has been received, which, IIRC, never happens in our case. Is the hope here is that some middlemen peek into the {{'Connection'}} header and based on it decide whether to close the connection themselves when their client disconnects even though the response might not be completed? > Include 'Connection: close' header in streaming API responses. > -- > > Key: MESOS-9189 > URL: https://issues.apache.org/jira/browse/MESOS-9189 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > > We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections > to mesos as an optimization to avoid re-connection overhead. As a result, > when the end-client of the streaming API disconnects from the intermediary, > the intermediary leaves the connection to mesos open in an attempt to re-use > the connection for another request once the response completes. Mesos then > thinks that the subscriber never disconnected and the intermediary happily > continues to read the streaming events even though there's no end-client. > To help indicate to intermediaries that the connection SHOULD NOT be re-used, > we can set the 'Connection: close' header for streaming API responses. It may > not be respected (since the language seems to be SHOULD NOT), but some > intermediaries may respect it and close the connection if the end-client > disconnects. > Note that libprocess' http server currently doesn't close the the connection > based on a handler setting this header, but it doesn't matter here since the > streaming API responses are infinite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9158) Batch state-related read-only requests in the Master actor.
[ https://issues.apache.org/jira/browse/MESOS-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595547#comment-16595547 ] Alexander Rukletsov edited comment on MESOS-9158 at 8/28/18 8:17 PM: - {noformat} commit 4118a482a95793252f4713c5e20ef2c70f2ab07b Author: Benno Evers AuthorDate: Tue Aug 28 21:25:52 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:25:52 2018 +0200 Added '/state-summary' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68321/ {noformat} {noformat} commit 63e9096b0cd883d9edc8907a577bcba0b150b541 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:03 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:03 2018 +0200 Added '/tasks' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68440/ {noformat} {noformat} commit 33c38c9baa20b42562b519971df508283d988abc Author: Benno Evers AuthorDate: Tue Aug 28 21:26:11 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:11 2018 +0200 Added '/slaves' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68441/ {noformat} {noformat} commit 102dcca4e0116d2ffbdcd78d998e032841ffbabe Author: Benno Evers AuthorDate: Tue Aug 28 21:26:18 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:18 2018 +0200 Added '/frameworks' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68442/ {noformat} {noformat} commit 44e523490b394e6c43bce8b5304996137d176f96 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:25 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:25 2018 +0200 Moved members of `ReadOnlyHandler` into separate file. Moved the member functions of class `ReadOnlyHandler` into the new file `readonly_handler.cpp`. This follows the pattern established by `weights_handler.cpp` and `quota_handler.cpp`. As part of this move, it was also necessary to move some JSON serialization that are used from both `master.cpp` and `readonly_handler.cpp` to a new pair of files `json.{cpp,hpp}` that can be used from both places. Review: https://reviews.apache.org/r/68473/ {noformat} {noformat} commit 4930ec2e141920411fb9050500f385f5ef6a78a2 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:36 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:49:41 2018 +0200 Cleaned up some style issues in `ReadOnlyHandler`. This commit fixes several minor style issues: - Sorted member function declarations of `ReadOnlyHandler` alphabetically. - Added notes to remind readers of the fact that requests to certain endpoints are batched. - Changed captured variable in `/frameworks` endpoint handler. Review: https://reviews.apache.org/r/68537/ {noformat} was (Author: alexr): {noformat} commit 4118a482a95793252f4713c5e20ef2c70f2ab07b Author: Benno Evers AuthorDate: Tue Aug 28 21:25:52 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:25:52 2018 +0200 Added '/state-summary' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68321/ {noformat} {noformat} commit 63e9096b0cd883d9edc8907a577bcba0b150b541 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:03 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:03 2018 +0200 Added '/tasks' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68440/ {noformat} {noformat} commit 33c38c9baa20b42562b519971df508283d988abc Author: Benno Evers AuthorDate: Tue Aug 28 21:26:11 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:11 2018 +0200 Added '/slaves' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68441/ {noformat} {noformat} commit 102dcca4e0116d2ffbdcd78d998e032841ffbabe Author: Benno Evers AuthorDate: Tue Aug 28 21:26:18 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:18 2018 +0200 Added '/frameworks' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68442/ {format} {format} commit 44e523490b394e6c43bce8b5304996137d176f96 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:25 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:25 2018 +0200 Moved members of `ReadOnlyHandler` into separate file. Moved the member functions of class `ReadOnlyHandler` into the new file `readonly_handler.cpp`. This follows the pattern established by `weights_handler.cpp` and `quota_handler.cpp`. As part of this move, it was also necessary to move some JSON serialization that are used from both `master.cpp` and `readonly_handler.cpp` to a new pair of
[jira] [Commented] (MESOS-9158) Batch state-related read-only requests in the Master actor.
[ https://issues.apache.org/jira/browse/MESOS-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595547#comment-16595547 ] Alexander Rukletsov commented on MESOS-9158: {noformat} commit 4118a482a95793252f4713c5e20ef2c70f2ab07b Author: Benno Evers AuthorDate: Tue Aug 28 21:25:52 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:25:52 2018 +0200 Added '/state-summary' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68321/ {noformat} {noformat} commit 63e9096b0cd883d9edc8907a577bcba0b150b541 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:03 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:03 2018 +0200 Added '/tasks' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68440/ {noformat} {noformat} commit 33c38c9baa20b42562b519971df508283d988abc Author: Benno Evers AuthorDate: Tue Aug 28 21:26:11 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:11 2018 +0200 Added '/slaves' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68441/ {noformat} {noformat} commit 102dcca4e0116d2ffbdcd78d998e032841ffbabe Author: Benno Evers AuthorDate: Tue Aug 28 21:26:18 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:18 2018 +0200 Added '/frameworks' to the set of batched master endpoints. Review: https://reviews.apache.org/r/68442/ {format} {format} commit 44e523490b394e6c43bce8b5304996137d176f96 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:25 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:26:25 2018 +0200 Moved members of `ReadOnlyHandler` into separate file. Moved the member functions of class `ReadOnlyHandler` into the new file `readonly_handler.cpp`. This follows the pattern established by `weights_handler.cpp` and `quota_handler.cpp`. As part of this move, it was also necessary to move some JSON serialization that are used from both `master.cpp` and `readonly_handler.cpp` to a new pair of files `json.{cpp,hpp}` that can be used from both places. Review: https://reviews.apache.org/r/68473/ {noformat} {noformat} commit 4930ec2e141920411fb9050500f385f5ef6a78a2 Author: Benno Evers AuthorDate: Tue Aug 28 21:26:36 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 21:49:41 2018 +0200 Cleaned up some style issues in `ReadOnlyHandler`. This commit fixes several minor style issues: - Sorted member function declarations of `ReadOnlyHandler` alphabetically. - Added notes to remind readers of the fact that requests to certain endpoints are batched. - Changed captured variable in `/frameworks` endpoint handler. Review: https://reviews.apache.org/r/68537/ {noformat} > Batch state-related read-only requests in the Master actor. > --- > > Key: MESOS-9158 > URL: https://issues.apache.org/jira/browse/MESOS-9158 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: mesosphere, performance > > Similar to MESOS-9122, make all read-only master state endpoints batched. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9185) An attempt to remove or destroy container in composing containerizer leads to segfault.
[ https://issues.apache.org/jira/browse/MESOS-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595084#comment-16595084 ] Alexander Rukletsov edited comment on MESOS-9185 at 8/28/18 4:11 PM: - *1.8.0-dev:* {noformat} commit 8496b369d52d27e90da88787242fd6f9d9abb78e Author: Andrei Budnik AuthorDate: Tue Aug 28 16:46:54 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 16:46:54 2018 +0200 Added `AgentAPITest.LaunchNestedContainerWithUnknownParent` test. This test verifies that launch nested container fails when the parent container is unknown to the containerizer. Review: https://reviews.apache.org/r/68234/ {noformat} {noformat} commit 5fbfb8da5ad62c40752fa7b7e0a0842c892f6857 Author: Andrei Budnik AuthorDate: Tue Aug 28 16:47:04 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 16:47:04 2018 +0200 Cleaned up container on launch failures in composing containerizer. Previously, if a parent container was unknown to the composing containerizer during an attempt to launch a nested container via `ComposingContainerizerProcess::launch()`, the composing containerizer returned an error without cleaning up the container. The `containerizer` field was uninitialized, so a further attempt to remove or destroy the nested container led to segfault. This patch removes the container when the parent container is unknown. Review: https://reviews.apache.org/r/68235/ {noformat} *backport to 1.7.1:* {noformat} commit 1660a0552e58ba4407180508f7e4eeed2050b2a2 Author: Andrei Budnik AuthorDate: Tue Aug 28 16:47:04 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 18:07:44 2018 +0200 Cleaned up container on launch failures in composing containerizer. Previously, if a parent container was unknown to the composing containerizer during an attempt to launch a nested container via `ComposingContainerizerProcess::launch()`, the composing containerizer returned an error without cleaning up the container. The `containerizer` field was uninitialized, so a further attempt to remove or destroy the nested container led to segfault. This patch removes the container when the parent container is unknown. Review: https://reviews.apache.org/r/68235/ (cherry picked from commit 5fbfb8da5ad62c40752fa7b7e0a0842c892f6857) {noformat} was (Author: alexr): {noformat} commit 8496b369d52d27e90da88787242fd6f9d9abb78e Author: Andrei Budnik AuthorDate: Tue Aug 28 16:46:54 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 16:46:54 2018 +0200 Added `AgentAPITest.LaunchNestedContainerWithUnknownParent` test. This test verifies that launch nested container fails when the parent container is unknown to the containerizer. Review: https://reviews.apache.org/r/68234/ {noformat} {noformat} commit 5fbfb8da5ad62c40752fa7b7e0a0842c892f6857 Author: Andrei Budnik AuthorDate: Tue Aug 28 16:47:04 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 28 16:47:04 2018 +0200 Cleaned up container on launch failures in composing containerizer. Previously, if a parent container was unknown to the composing containerizer during an attempt to launch a nested container via `ComposingContainerizerProcess::launch()`, the composing containerizer returned an error without cleaning up the container. The `containerizer` field was uninitialized, so a further attempt to remove or destroy the nested container led to segfault. This patch removes the container when the parent container is unknown. Review: https://reviews.apache.org/r/68235/ {noformat} > An attempt to remove or destroy container in composing containerizer leads to > segfault. > --- > > Key: MESOS-9185 > URL: https://issues.apache.org/jira/browse/MESOS-9185 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.7.0 >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.8.0 > > > `LAUNCH_NESTED_CONTAINER` and `LAUNCH_NESTED_CONTAINER_SESSION` leads to > segfault in the agent when the parent container is unknown to the composing > containerizer. If the parent container cannot be found during an attempt to > launch a nested container via `ComposingContainerizerProcess::launch()`, the > composing container returns an error without cleaning up the container. On > `launch()` failures, the agent calls `destroy()` which accesses uninitialized > `containerizer` field. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8345) Improve master responsiveness while serving state information.
[ https://issues.apache.org/jira/browse/MESOS-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-8345: -- Assignee: Alexander Rukletsov > Improve master responsiveness while serving state information. > -- > > Key: MESOS-8345 > URL: https://issues.apache.org/jira/browse/MESOS-8345 > Project: Mesos > Issue Type: Epic > Components: HTTP API, master >Reporter: Benjamin Mahler >Assignee: Alexander Rukletsov >Priority: Major > Labels: mesosphere, performance > > Currently when state is requested from the master, the response is built > using the master actor. This means that when the master is building an > expensive state response, the master is locked and cannot process other > events. This in turn can lead to higher latency on further requests to state. > Previous performance improvements to JSON generation (MESOS-4235) alleviated > this issue, but for large cluster with a lot of clients this can still be a > problem. > It's possible to serve state outside of the master actor by streaming the > state (re-using the existing streaming operator API) into another actor(s) > and serving from there. > NOTE: I believe this approach will incur a small performance cost during > master failover, since the master has to perform an additional copy of state > that it fans out. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9177) Mesos master segfaults when responding to /state requests.
Alexander Rukletsov created MESOS-9177: -- Summary: Mesos master segfaults when responding to /state requests. Key: MESOS-9177 URL: https://issues.apache.org/jira/browse/MESOS-9177 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.7.0 Reporter: Alexander Rukletsov {noformat} *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; stack trace: *** @ 0x7f367e7226d0 (unknown) @ 0x7f3681266913 _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ @ 0x7f3681266af0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f36812882d0 mesos::internal::master::FullFrameworkWriter::operator()() @ 0x7f36812889d0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f368121aef0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f3681241be3 _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ @ 0x7f3681242760 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv @ 0x7f368215f60e process::http::OK::OK() @ 0x7f3681219061 _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ @ 0x7f36812212c0 _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ @ 0x7f36812215ac _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_ @ 0x7f36821f3541 process::ProcessBase::consume() @ 0x7f3682209fbc process::ProcessManager::resume() @ 0x7f368220fa76 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f367eefc2b0 (unknown) @ 0x7f367e71ae25 start_thread @ 0x7f367e444bad __clone {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9176) Mesos does not work properly on modern Ubuntu distributions.
Alexander Rukletsov created MESOS-9176: -- Summary: Mesos does not work properly on modern Ubuntu distributions. Key: MESOS-9176 URL: https://issues.apache.org/jira/browse/MESOS-9176 Project: Mesos Issue Type: Epic Affects Versions: 1.7.0 Environment: Ubuntu 17.10 Ubuntu 18.04 Reporter: Alexander Rukletsov We have observed several issues in various components on moder Ubuntus, e.g., 17.10, 18.04. Needless to say, we need to ensure Mesos compiles and runs fine on those distros. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9000) Operator API event stream can miss task status updates.
[ https://issues.apache.org/jira/browse/MESOS-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587188#comment-16587188 ] Alexander Rukletsov edited comment on MESOS-9000 at 8/21/18 9:12 AM: - On the 1.8.0-dev: {noformat} commit 613741147123563f7b68e900c321e7f5db8236fe Author: Benno Evers AuthorDate: Tue Aug 21 10:58:35 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 21 11:04:37 2018 +0200 Changed operator API to notify subscribers on every status change. Prior to this change, the master would only send `TaskUpdated` messages to subscribers when the latest known task state on the agent changed. This implied that schedulers could not reliably wait for the status information corresponding to specific state updates, i.e., `TASK_RUNNING`, since there is no guarantee that subscribers get notified during the time when this status update will be included in the status field. After this change, `TaskUpdated` messages are sent whenever the latest acknowledged state of the task changes. Review: https://reviews.apache.org/r/67575/ {noformat} was (Author: alexr): {noformat} commit 613741147123563f7b68e900c321e7f5db8236fe Author: Benno Evers AuthorDate: Tue Aug 21 10:58:35 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 21 11:04:37 2018 +0200 Changed operator API to notify subscribers on every status change. Prior to this change, the master would only send `TaskUpdated` messages to subscribers when the latest known task state on the agent changed. This implied that schedulers could not reliably wait for the status information corresponding to specific state updates, i.e., `TASK_RUNNING`, since there is no guarantee that subscribers get notified during the time when this status update will be included in the status field. After this change, `TaskUpdated` messages are sent whenever the latest acknowledged state of the task changes. Review: https://reviews.apache.org/r/67575/ {noformat} > Operator API event stream can miss task status updates. > --- > > Key: MESOS-9000 > URL: https://issues.apache.org/jira/browse/MESOS-9000 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.7.0 > > > As of now, the master only sends TaskUpdated messages to subscribers when the > latest known task state on the agent changed: > {noformat} > // src/master/master.cpp > if (!protobuf::isTerminalState(task->state())) { > if (status.state() != task->state()) { > sendSubscribersUpdate = true; > } > task->set_state(latestState.getOrElse(status.state())); > } > {noformat} > The latest state is set like this: > {noformat} > // src/messages/messages.proto > message StatusUpdate { > [...] > // This corresponds to the latest state of the task according to the > // agent. Note that this state might be different than the state in > // 'status' because task status update manager queues updates. In > // other words, 'status' corresponds to the update at top of the > // queue and 'latest_state' corresponds to the update at bottom of > // the queue. > optional TaskState latest_state = 7; > } > {noformat} > However, the `TaskStatus` message included in an `TaskUpdated` event is the > event at the bottom of the queue when the update was sent. > So we can easily get in a situation where e.g. the first TaskUpdated has > .status.state == TASK_STARTING and .state == TASK_RUNNING, and the second > update with .status.state == TASK_RUNNNING and .state == TASK_RUNNING would > not get delivered because the latest known state did not change. > This implies that schedulers can not reliably wait for the status information > corresponding to specific task state, since there is no guarantee that > subscribers get notified during the time when this status update will be > included in the status field. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9000) Operator API event stream can miss task status updates.
[ https://issues.apache.org/jira/browse/MESOS-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587193#comment-16587193 ] Alexander Rukletsov commented on MESOS-9000: On the 1.7.x branch: {noformat} commit a2f826d5a641b8ae3e5742ffeab7166281e296f8 Author: Benno Evers AuthorDate: Tue Aug 21 10:58:35 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Aug 21 11:08:41 2018 +0200 Changed operator API to notify subscribers on every status change. Prior to this change, the master would only send `TaskUpdated` messages to subscribers when the latest known task state on the agent changed. This implied that schedulers could not reliably wait for the status information corresponding to specific state updates, i.e., `TASK_RUNNING`, since there is no guarantee that subscribers get notified during the time when this status update will be included in the status field. After this change, `TaskUpdated` messages are sent whenever the latest acknowledged state of the task changes. Review: https://reviews.apache.org/r/67575/ {noformat} > Operator API event stream can miss task status updates. > --- > > Key: MESOS-9000 > URL: https://issues.apache.org/jira/browse/MESOS-9000 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.7.0 > > > As of now, the master only sends TaskUpdated messages to subscribers when the > latest known task state on the agent changed: > {noformat} > // src/master/master.cpp > if (!protobuf::isTerminalState(task->state())) { > if (status.state() != task->state()) { > sendSubscribersUpdate = true; > } > task->set_state(latestState.getOrElse(status.state())); > } > {noformat} > The latest state is set like this: > {noformat} > // src/messages/messages.proto > message StatusUpdate { > [...] > // This corresponds to the latest state of the task according to the > // agent. Note that this state might be different than the state in > // 'status' because task status update manager queues updates. In > // other words, 'status' corresponds to the update at top of the > // queue and 'latest_state' corresponds to the update at bottom of > // the queue. > optional TaskState latest_state = 7; > } > {noformat} > However, the `TaskStatus` message included in an `TaskUpdated` event is the > event at the bottom of the queue when the update was sent. > So we can easily get in a situation where e.g. the first TaskUpdated has > .status.state == TASK_STARTING and .state == TASK_RUNNING, and the second > update with .status.state == TASK_RUNNNING and .state == TASK_RUNNING would > not get delivered because the latest known state did not change. > This implies that schedulers can not reliably wait for the status information > corresponding to specific task state, since there is no guarantee that > subscribers get notified during the time when this status update will be > included in the status field. -- This message was sent by Atlassian JIRA (v7.6.3#76005)