from:"Alexander Rukletsov \(JIRA\)"

[jira] [Commented] (MESOS-9766) /processes endpoint can hang.

2019-05-23 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846645#comment-16846645
 ] 

Alexander Rukletsov commented on MESOS-9766:


{noformat:title=1.9.0 only}
commit a8c411d3f8d2895ff5e95c412ef2f3e94713520f
Author: Alexander Rukletsov 
AuthorDate: Fri May 3 13:23:50 2019 +0200
Commit: Alexander Rukletsov 
CommitDate: Thu May 23 12:58:32 2019 +0200

Logged when `/__processes__` returns.

Adds a log entry when a response with generated by `/__processes__`
is about to be returned to the client.

Review: https://reviews.apache.org/r/70589
{noformat}

> /__processes__ endpoint can hang.
> -
>
> Key: MESOS-9766
> URL: https://issues.apache.org/jira/browse/MESOS-9766
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
>
> A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs.
> Stack traces provided by [~alexr] revealed that all the threads appeared to 
> be idle waiting for events. After investigating the code, the issue was found 
> to be possible when a process gets terminated after the 
> {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the 
> dispatch and abandoning the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9791) Libprocess does not support server only SSL certificate verification.

2019-05-21 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844701#comment-16844701
 ] 

Alexander Rukletsov commented on MESOS-9791:


A prototype relaxing certificate verification: 
https://github.com/rukletsov/mesos/commits/alexr/ssl-server-cert

> Libprocess does not support server only SSL certificate verification.
> -
>
> Key: MESOS-9791
> URL: https://issues.apache.org/jira/browse/MESOS-9791
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: foundations, mesosphere, security, ssl, tls
>
> Currently SSL certificate verification in Libprocess can be configured in the 
> [following 
> ways|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L88-L97]:
> (1) send certificate if in server mode, verify peer certificates *if present*;
> (2) require valid peer certificates in *both* client and server modes.
> It is currently impossible to configure a Libprocess instance to 
> simultaneously:
> (3) require valid peer certificate in client mode and send certificate in 
> server mode.
> Because Libprocess is often used by programs that act both as servers and 
> clients, implementing (3) is necessary to enable the so-called 
> webserver-browser model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9791) Libprocess does not support server only SSL certificate verification.

2019-05-21 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9791:
--

 Summary: Libprocess does not support server only SSL certificate 
verification.
 Key: MESOS-9791
 URL: https://issues.apache.org/jira/browse/MESOS-9791
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Alexander Rukletsov


Currently SSL certificate verification in Libprocess can be configured in the 
[following 
ways|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L88-L97]:
(1) send certificate if in server mode, verify peer certificates *if present*;
(2) require valid peer certificates in *both* client and server modes.

It is currently impossible to configure a Libprocess instance to simultaneously:
(3) require valid peer certificate in client mode and send certificate in 
server mode.

Because Libprocess is often used by programs that act both as servers and 
clients, implementing (3) is necessary to enable the so-called 
webserver-browser model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9790) Libprocess does not use standard tooling for hostname validation.

2019-05-21 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9790:
--

 Summary: Libprocess does not use standard tooling for hostname 
validation. 
 Key: MESOS-9790
 URL: https://issues.apache.org/jira/browse/MESOS-9790
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Alexander Rukletsov


Libprocess currently uses [custom 
code|https://github.com/apache/mesos/blob/eecb82c77117998af0c67a53c64e9b1e975acfa4/3rdparty/libprocess/src/openssl.cpp#L755-L863]
 for hostname validation in its SSL certificate verification workflow. However 
openssl provides a function for this, [{{X509_check_host()}} 
|https://www.openssl.org/docs/manmaster/man3/X509_check_host.html].

For safety and reliability, we should enable an option to use 
{{X509_check_host()}} for hostname validation instead of our custom code, but 
preserve the custom code for backward compatibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error

2019-05-14 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839483#comment-16839483
 ] 

Alexander Rukletsov commented on MESOS-9329:


Indeed, the autotools build uses a newer version of libevent, 
[2.0.22|https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/3rdparty/libevent-2.0.22-stable.tar.gz].
 We can't easily use it in the cmake build because newer versions do not 
support cmake, see MESOS-3529. Bottom line is: a cmake build on Linux with ssl 
and libevent enabled is currently not supported.

> CMake build on Fedora 28 fails due to libevent error
> 
>
> Key: MESOS-9329
> URL: https://issues.apache.org/jira/browse/MESOS-9329
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> Trying to build Mesos using cmake with the options 
> {noformat}
> cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1
> {noformat}
> fails due to the following:
> {noformat}
> [  1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  In function ‘bio_bufferevent_new’:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
>  error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
>   b->init = 0;
>^~
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  At top level:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1:
>  error: variable ‘methods_bufferevent’ has initializer but incomplete type
>  static BIO_METHOD methods_bufferevent = {
> [...]
> {noformat}
> Since the autotools build does not have issues when enabling libevent and 
> ssl, it seems most likely that the `libevent-2.1.5-beta` version used by 
> default in the cmake build is somehow connected to the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9766) /processes endpoint can hang.

2019-05-06 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833733#comment-16833733
 ] 

Alexander Rukletsov commented on MESOS-9766:


Logging processing time: https://reviews.apache.org/r/70589/

> /__processes__ endpoint can hang.
> -
>
> Key: MESOS-9766
> URL: https://issues.apache.org/jira/browse/MESOS-9766
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs.
> Stack traces provided by [~alexr] revealed that all the threads appeared to 
> be idle waiting for events. After investigating the code, the issue was found 
> to be possible when a process gets terminated after the 
> {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the 
> dispatch and abandoning the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode

2019-04-30 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830104#comment-16830104
 ] 

Alexander Rukletsov commented on MESOS-9718:


[~QuellaZhang], [~abudnik], the proposed patch basically reverts 
https://reviews.apache.org/r/58430/. I understand that the patch compiles on 
the newest version of MSVC toolset, but does it compile on the older versions 
that are currently in use? To phrase it differently, why reasons for 
introducing https://reviews.apache.org/r/58430/ do no apply any more?

> Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode
> --
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
> Attachments: mesos.patch.txt
>
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const char8_t [16]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: 
> 'AssertSomeEq': no matching overloaded function found
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2780: 
>

[jira] [Commented] (MESOS-7935) CMake build should fail immediately for in-source builds

2019-03-08 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787804#comment-16787804
 ] 

Alexander Rukletsov commented on MESOS-7935:


[~csnate] — could you please upload the diff?

> CMake build should fail immediately for in-source builds
> 
>
> Key: MESOS-7935
> URL: https://issues.apache.org/jira/browse/MESOS-7935
> Project: Mesos
>  Issue Type: Improvement
>  Components: cmake
> Environment: macOS 10.12
> GNU/Linux Debian Stretch
>Reporter: Damien Gerard
>Assignee: Nathan Jackson
>Priority: Major
>  Labels: build
>
> In-source builds are neither recommended or supported.  It is simple enough 
> to add a check to fail the build immediately.
> ---
> In-source build of master branch was broken with:
> {noformat}
> cd /Users/damien.gerard/projects/acp/mesos/src && 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>   -DBUILD_FLAGS=\"\" -DBUILD_JAVA_JVM_LIBRARY=\"\" -DHAS_AUTHENTICATION=1 
> -DLIBDIR=\"/usr/local/libmesos\" -DPICOJSON_USE_INT64 
> -DPKGDATADIR=\"/usr/local/share/mesos\" 
> -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" -DUSE_CMAKE_BUILD_CONFIG 
> -DUSE_STATIC_LIB -DVERSION=\"1.4.0\" -D__STDC_FORMAT_MACROS 
> -Dmesos_1_4_0_EXPORTS -I/Users/damien.gerard/projects/acp/mesos/include 
> -I/Users/damien.gerard/projects/acp/mesos/include/mesos 
> -I/Users/damien.gerard/projects/acp/mesos/src -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/protobuf-3.3.0/src/protobuf-3.3.0-lib/lib/include
>  -isystem /Users/damien.gerard/projects/acp/mesos/3rdparty/libprocess/include 
> -isystem /usr/local/opt/apr/libexec/include/apr-1 -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/boost-1.53.0/src/boost-1.53.0
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/elfio-3.2/src/elfio-3.2 
> -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/glog-0.3.3/src/glog-0.3.3-lib/lib/include
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/nvml-352.79/src/nvml-352.79 
> -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/picojson-1.3.0/src/picojson-1.3.0
>  -isystem /usr/local/include/subversion-1 -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/stout/include -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/http_parser-2.6.2/src/http_parser-2.6.2
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/concurrentqueue-1.0.0-beta/src/concurrentqueue-1.0.0-beta
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/libev-4.22/src/libev-4.22 
> -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8/src/c/include
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8/src/c/generated
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/leveldb-1.19/src/leveldb-1.19/include
>   -std=c++11 -fPIC   -o 
> CMakeFiles/mesos-1.4.0.dir/slave/containerizer/mesos/provisioner/backends/copy.cpp.o
>  -c 
> /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/backends/copy.cpp
> /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/appc/store.cpp:132:46:
>  error: no member named 'fetcher' in namespace 'mesos::uri'; did you mean 
> 'Fetcher'?
>   Try> uriFetcher = uri::fetcher::create();
> ~^~~
>  Fetcher
> /Users/damien.gerard/projects/acp/mesos/include/mesos/uri/fetcher.hpp:46:7: 
> note: 'Fetcher' declared here
> class Fetcher
>   ^
> /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/appc/store.cpp:132:55:
>  error: no member named 'create' in 'mesos::uri::Fetcher'
>   Try> uriFetcher = uri::fetcher::create();
> {noformat}
> Both Linux & macOS, not tested elsewhere, on {{master}} and tag 1.4.0-rc3



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-6674) Add Python sources to the CMake build

2019-03-08 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-6674:
--

Assignee: (was: Srinivas)

> Add Python sources to the CMake build
> -
>
> Key: MESOS-6674
> URL: https://issues.apache.org/jira/browse/MESOS-6674
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Joseph Wu
>Priority: Major
>  Labels: microsoft
>
> The Python portion of the build includes a scheduler and executor driver as 
> well as Mesos protobufs.  Eventually, there will also be a CLI component as 
> well.
> See the automake sources for more details.  i.e.
> https://github.com/apache/mesos/blob/2a73d956af1cb0615d4e66de126ab554fdabb0b5/src/Makefile.am#L1726-L1752



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-2382) replace unsafe "find | xargs" with "find -exec"

2019-03-08 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-2382:
--

Assignee: (was: Diana Arroyo)

> replace unsafe "find | xargs" with "find -exec"
> ---
>
> Key: MESOS-2382
> URL: https://issues.apache.org/jira/browse/MESOS-2382
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.20.1
>Reporter: Lukas Loesche
>Priority: Major
>  Labels: easyfix, patch
>
> The problem exists in
>  1194:src/Makefile.am
>  47:src/tests/balloon_framework_test.sh
> The current "find | xargs rm -rf" in the Makefile could potentially destroy 
> data if mesos source was in a folder with a space in the name. E.g. if you 
> for some reason checkout mesos to "/ mesos" the command in src/Makefile.am 
> would turn into a rm -rf /
> "find | xargs" should be NUL delimited with "find -print0 | xargs -0" for 
> safer execution or can just be replaced with the find build-in option "find 
> -exec '{}' \+" which behaves similar to xargs.
> There was a second occurrence of this in a test script, though in that case 
> it would only rmdir empty folders, so is less critical.
> I submitted a PR here: https://github.com/apache/mesos/pull/36



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-2379) Disabled master authentication causes authentication retries in the scheduler.

2019-03-08 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787793#comment-16787793
 ] 

Alexander Rukletsov commented on MESOS-2379:


B. seems to be implemented now: 
https://github.com/apache/mesos/blob/996862828ca9b7675e40b495fe24d95615bb832b/src/sched/sched.cpp#L487-L505

C. is questionable: for scheduler library to understand how to recover from 
{{AuthenticationErrorMessage}}, we should augment 
{{AuthenticationErrorMessage}} with a hint what kind of error has happened (we 
already do this in the error string), think {{Reason}} enum. 

On the other side, we might not want to mask such errors and make sure an 
operator is engaged: what if the intention was to enable authentication (and 
this is why scheduler tries it), but the master was misconfigured?

> Disabled master authentication causes authentication retries in the 
> scheduler. 
> ---
>
> Key: MESOS-2379
> URL: https://issues.apache.org/jira/browse/MESOS-2379
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Reporter: Till Toenshoff
>Priority: Major
>  Labels: authentication, tech-debt
>
> The CRAM-MD5 authenticator relies upon shared credentials. Not supplying such 
> credentials while starting up a master effectively disables any 
> authentication.
> A framework (or slave) may still attempt to authenticate which is answered by 
> an {{AuthenticationErrorMessage}} by the master. That in turn will cause the 
> authenticatee to fail its {{authenticate}} promise, which in turn will cause 
> the current framework driver implementation to infinitely (and unthrottled) 
> retry authentication.
> See: https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L372
> {noformat}
> if (reauthenticate || !future.isReady()) {
>   LOG(INFO)
> << "Failed to authenticate with master " << master.get() << ": "
> << (reauthenticate ? "master changed" :
>(future.isFailed() ? future.failure() : "future discarded"));
>   authenticating = None();
>   reauthenticate = false;
>   // TODO(vinod): Add a limit on number of retries.
>   dispatch(self(), ::authenticate); // Retry.
>   return;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-3973) Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.

2019-03-08 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787789#comment-16787789
 ] 

Alexander Rukletsov commented on MESOS-3973:


[~chhsia0]
The steps were:
{noformat}
git clone https://github.com/apache/mesos mesos-1.8.0
cd mesos-1.8.0
./bootstrap
mkdir build
cd build
../configure
make distcheck
{noformat}
However, saying {{make}} before {{make distcheck}} fixes this for me.

> Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.
> -
>
> Key: MESOS-3973
> URL: https://issues.apache.org/jira/browse/MESOS-3973
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.21.0, 0.21.2, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0
> Environment: Mac OS X 10.10.5, Clang 7.0.0.
>Reporter: Bernd Mathiske
>Priority: Major
>  Labels: build, build-failure, mesosphere
> Attachments: dist_check.log
>
>
> Non-root 'make distcheck.
> {noformat}
> ...
> [--] Global test environment tear-down
> [==] 826 tests from 113 test cases ran. (276624 ms total)
> [  PASSED  ] 826 tests.
>   YOU HAVE 6 DISABLED TESTS
> Making install in .
> make[3]: Nothing to be done for `install-exec-am'.
>  ../install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig'
>  /usr/bin/install -c -m 644 mesos.pc 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig'
> Making install in 3rdparty
> /Applications/Xcode.app/Contents/Developer/usr/bin/make  install-recursive
> Making install in libprocess
> Making install in 3rdparty
> /Applications/Xcode.app/Contents/Developer/usr/bin/make  install-recursive
> Making install in stout
> Making install in .
> make[9]: Nothing to be done for `install-exec-am'.
> make[9]: Nothing to be done for `install-data-am'.
> Making install in include
> make[9]: Nothing to be done for `install-exec-am'.
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include'
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include/stout'
>  /usr/bin/install -c -m 644  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/abort.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/attributes.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/base64.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bits.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bytes.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/cache.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/check.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/duration.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/dynamiclibrary.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/exit.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/fs.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gtest.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashmap.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/interval.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/ip.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/json.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/lambda.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/list.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/mac.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multimap.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/none.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/nothing.hpp
>  
>

[jira] [Assigned] (MESOS-2235) Better path handling when using system-wide installations of third party dependencies

2019-03-08 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-2235:
--

Assignee: (was: Kapil Arya)

> Better path handling when using system-wide installations of third party 
> dependencies
> -
>
> Key: MESOS-2235
> URL: https://issues.apache.org/jira/browse/MESOS-2235
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Reporter: Kapil Arya
>Priority: Minor
>  Labels: mesosphere
>
> Currently, if one wishes to use the system-wide installation of third party 
> dependencies such as protobuf, the following configure command line is used:
> {code}
> ../configure --with-protobuf=/usr
> {code}
> The configure scripts then adds "/usr/include" to include path and /usr/lib 
> to library path.  However, on some 64-bit systems (e.g., OpenSuse), /usr/lib 
> points to the 32-bit libraries and thus the build system ends up printing a 
> bunch of warnings:
> {code}
> libtool: link: g++ -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -o 
> .libs/mesos-slave slave/mesos_slave-main.o  -L/usr/lib ./.libs/libmesos.so 
> -lprotobuf -lsasl2 -lsvn_delta-1 -lsvn_subr-1 -lapr-1 -lcurl -lz -lpthread 
> -lrt -lunwind
> /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: 
> skipping incompatible /usr/lib/libpthread.so when searching for -lpthread
> /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: 
> skipping incompatible /usr/lib/librt.so when searching for -lrt
> /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: 
> skipping incompatible /usr/lib/libm.so when searching for -lm
> /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: 
> skipping incompatible /usr/lib/libc.so when searching for -lc
> {code}
> Further, if someone uses system-wide installations, we can omit the path with 
> the configure flag and the system should be able to pick the correct flags. 
> E.g, the above example becomes:
> {code}
> ../configure --with-protobuf
> {code}
> Since, the correct system include and lib dirs are already in the standard 
> path, we don't need to specify that path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9638) Mesos masters do no authenticate with agents.

2019-03-07 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9638:
--

 Summary: Mesos masters do no authenticate with agents.
 Key: MESOS-9638
 URL: https://issues.apache.org/jira/browse/MESOS-9638
 Project: Mesos
  Issue Type: Improvement
  Components: agent, master
Reporter: Alexander Rukletsov


Currently Mesos agents do not verify that the messages they receive are coming 
from the leading master and haven't been tampered with. In untrusted 
environments this can be a source of security issues.

There are a couple of ways to fix this:
1) implement Master authentication on the transport or application level for 
each {{agent}}<->{{master}} connection (this might not be sufficient to 
distinguish a master from the leading master)
2) implement Master authentication on the transport level (for the connection 
to be encrypted) upon agent registration and pass a secret to the master for 
all subsequent, possibly separate and unencrypted, connections (the secret can 
be leaked on an unencrypted connection).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-3973) Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.

2019-03-07 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786742#comment-16786742
 ] 

Alexander Rukletsov commented on MESOS-3973:


As of today, {{make distcheck}} for {{1.8.0-dev}} on Mac OS 10.13.6 still 
fails, while {{make check}} works. However, looking at the log, the problem now 
seems to be GRPC support. 
{noformat}
touch libev-4.22-build-stamp
../protobuf-3.5.0/src/protoc -I../../../3rdparty/libprocess/src/tests 
--cpp_out=. ../../../3rdparty/libprocess/src/tests/benchmarks.proto
../protobuf-3.5.0/src/protoc -I../../../3rdparty/libprocess/src/tests 
--grpc_out=. ../../../3rdparty/libprocess/src/tests/grpc_tests.proto
  \
  
--plugin=protoc-gen-grpc=../grpc-1.10.0/bins/opt/grpc_cpp_plugin
../protobuf-3.5.0/src/protoc -I../../../3rdparty/libprocess/src/tests 
--cpp_out=. ../../../3rdparty/libprocess/src/tests/grpc_tests.proto
/Library/Developer/CommandLineTools/usr/bin/make  distdir-am
 (cd include && /Library/Developer/CommandLineTools/usr/bin/make  
top_distdir=../../../mesos-1.8.0 
distdir=../../../mesos-1.8.0/3rdparty/libprocess/include \
 am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
/Library/Developer/CommandLineTools/usr/bin/make  distdir-am
/Library/Developer/CommandLineTools/usr/bin/make  \
  top_distdir="../../mesos-1.8.0" 
distdir="../../mesos-1.8.0/3rdparty/libprocess" \
  dist-hook
cp -r ../../../3rdparty/libprocess/3rdparty 
../../mesos-1.8.0/3rdparty/libprocess/
 (cd src && /Library/Developer/CommandLineTools/usr/bin/make  
top_distdir=../mesos-1.8.0 distdir=../mesos-1.8.0/src \
 am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
make[3]: *** No rule to make target `../include/csi/csi.grpc.pb.cc', needed by 
`distdir'.  Stop.
make[2]: *** [distdir-am] Error 1
make[1]: *** [distdir] Error 2
make: *** [dist] Error 2
{noformat}
[~chhsia0] any chance you have an idea why off the top of your head?

> Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.
> -
>
> Key: MESOS-3973
> URL: https://issues.apache.org/jira/browse/MESOS-3973
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.21.0, 0.21.2, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0
> Environment: Mac OS X 10.10.5, Clang 7.0.0.
>Reporter: Bernd Mathiske
>Priority: Major
>  Labels: build, build-failure, mesosphere
> Attachments: dist_check.log
>
>
> Non-root 'make distcheck.
> {noformat}
> ...
> [--] Global test environment tear-down
> [==] 826 tests from 113 test cases ran. (276624 ms total)
> [  PASSED  ] 826 tests.
>   YOU HAVE 6 DISABLED TESTS
> Making install in .
> make[3]: Nothing to be done for `install-exec-am'.
>  ../install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig'
>  /usr/bin/install -c -m 644 mesos.pc 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig'
> Making install in 3rdparty
> /Applications/Xcode.app/Contents/Developer/usr/bin/make  install-recursive
> Making install in libprocess
> Making install in 3rdparty
> /Applications/Xcode.app/Contents/Developer/usr/bin/make  install-recursive
> Making install in stout
> Making install in .
> make[9]: Nothing to be done for `install-exec-am'.
> make[9]: Nothing to be done for `install-data-am'.
> Making install in include
> make[9]: Nothing to be done for `install-exec-am'.
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include'
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include/stout'
>  /usr/bin/install -c -m 644  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/abort.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/attributes.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/base64.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bits.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bytes.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/cache.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/check.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/duration.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/dynamiclibrary.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/exit.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp 
>

[jira] [Assigned] (MESOS-1776) --without-PACKAGE will set incorrect dependency prefix

2019-03-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-1776:
--

Assignee: (was: Kamil Domański)

> --without-PACKAGE will set incorrect dependency prefix
> --
>
> Key: MESOS-1776
> URL: https://issues.apache.org/jira/browse/MESOS-1776
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.20.0
>Reporter: Kamil Domański
>Priority: Major
>  Labels: build
>
> When disabling a particular bundled dependency with *--without-PACKAGE*, the 
> build scripts of both Mesos and libprocess will set a corresponding variable 
> to "no". This is later treated as prefix under which to search for the 
> package.
> For example, with *--without-protobuf*, the script will search for *protoc* 
> under *no/bin* and obviously fail. I would propose to get rid of these 
> prefixes entirely and instead search in default locations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-1602) Add checks for unbundled libev

2019-03-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-1602:
--

Assignee: (was: Timothy St. Clair)

> Add checks for unbundled libev
> --
>
> Key: MESOS-1602
> URL: https://issues.apache.org/jira/browse/MESOS-1602
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.20.0
>Reporter: Timothy St. Clair
>Priority: Major
>
> Per review breakout a check to ensure libev has been compiled with 
> -DEV_CHILD_ENABLE=0
> Plus update checks for prefix'd installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9636) Autotools improvements

2019-03-07 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9636:
--

 Summary: Autotools improvements
 Key: MESOS-9636
 URL: https://issues.apache.org/jira/browse/MESOS-9636
 Project: Mesos
  Issue Type: Epic
  Components: build
Reporter: Alexander Rukletsov






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-3973) Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.

2019-03-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-3973:
--

Assignee: (was: Gilbert Song)

> Failing 'make distcheck' on Mac OS X 10.10.5, also 10.11.
> -
>
> Key: MESOS-3973
> URL: https://issues.apache.org/jira/browse/MESOS-3973
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.21.0, 0.21.2, 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0
> Environment: Mac OS X 10.10.5, Clang 7.0.0.
>Reporter: Bernd Mathiske
>Priority: Major
>  Labels: build, build-failure, mesosphere
> Attachments: dist_check.log
>
>
> Non-root 'make distcheck.
> {noformat}
> ...
> [--] Global test environment tear-down
> [==] 826 tests from 113 test cases ran. (276624 ms total)
> [  PASSED  ] 826 tests.
>   YOU HAVE 6 DISABLED TESTS
> Making install in .
> make[3]: Nothing to be done for `install-exec-am'.
>  ../install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig'
>  /usr/bin/install -c -m 644 mesos.pc 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/lib/pkgconfig'
> Making install in 3rdparty
> /Applications/Xcode.app/Contents/Developer/usr/bin/make  install-recursive
> Making install in libprocess
> Making install in 3rdparty
> /Applications/Xcode.app/Contents/Developer/usr/bin/make  install-recursive
> Making install in stout
> Making install in .
> make[9]: Nothing to be done for `install-exec-am'.
> make[9]: Nothing to be done for `install-data-am'.
> Making install in include
> make[9]: Nothing to be done for `install-exec-am'.
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include'
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/install-sh -c -d 
> '/Users/bernd/mesos/mesos/build/mesos-0.26.0/_inst/include/stout'
>  /usr/bin/install -c -m 644  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/abort.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/attributes.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/base64.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bits.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/bytes.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/cache.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/check.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/duration.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/dynamiclibrary.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/exit.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/fs.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gtest.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashmap.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/interval.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/ip.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/json.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/lambda.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/list.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/mac.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/multimap.hpp
>  ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/none.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/nothing.hpp
>  
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/numify.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/option.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp 
> ../../../../../../3rdparty/libprocess/3rdparty/stout/include/stout/path.hpp

[jira] [Assigned] (MESOS-2092) Make ACLs dynamic

2019-03-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-2092:
--

Assignee: (was: Yongqiao Wang)

> Make ACLs dynamic
> -
>
> Key: MESOS-2092
> URL: https://issues.apache.org/jira/browse/MESOS-2092
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, newbie
>
> Master loads ACLs once during its launch and there is no way to update them 
> in a running master. Making them dynamic will allow for updating ACLs on the 
> fly, for example granting a new framework necessary rights.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-4036) Install instructions for CentOS 6.6 lead to errors running `perf`.

2019-03-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-4036:
--

Assignee: Alexander Rukletsov

> Install instructions for CentOS 6.6 lead to errors running `perf`.
> --
>
> Key: MESOS-4036
> URL: https://issues.apache.org/jira/browse/MESOS-4036
> Project: Mesos
>  Issue Type: Improvement
> Environment: CentOS 6.6
>Reporter: Greg Mann
>Assignee: Alexander Rukletsov
>Priority: Minor
>  Labels: integration, mesosphere, newbie
>
> After using the current installation instructions in the getting started 
> documentation, {{perf}} will not run on CentOS 6.6 because the version of 
> elfutils included in devtoolset-2 is not compatible with the version of 
> {{perf}} installed by {{yum}}. Installing and using devtoolset-3, however 
> (http://linux.web.cern.ch/linux/scientific6/docs/softwarecollections.shtml) 
> fixes this issue. This could be resolved by updating the getting started 
> documentation to recommend installing devtoolset-3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-5588) Improve error handling when parsing acls.

2019-03-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-5588:
--

Assignee: (was: Jörg Schad)

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jörg Schad
>Priority: Major
>  Labels: mesosphere, security
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-5027) Enable authenticated login in the webui

2019-03-07 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786559#comment-16786559
 ] 

Alexander Rukletsov commented on MESOS-5027:


Apparently, nothing grew out of the haosdent's attempt. Closing this as "won't 
do".

> Enable authenticated login in the webui
> ---
>
> Key: MESOS-5027
> URL: https://issues.apache.org/jira/browse/MESOS-5027
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security, webui
>Reporter: Greg Mann
>Assignee: haosdent
>Priority: Major
>  Labels: mesosphere, security
> Attachments: Screen Shot 2016-04-07 at 21.02.45.png
>
>
> The webui hits a number of endpoints to get the data that it displays: 
> {{/state}}, {{/metrics/snapshot}}, {{/files/browse}}, {{/files/read}}, and 
> maybe others? Once authentication is enabled on these endpoints, we need to 
> add a login prompt to the webui so that users can provide credentials.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9579) ExecutorHttpApiTest.HeartbeatCalls is flaky.

2019-03-06 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785917#comment-16785917
 ] 

Alexander Rukletsov commented on MESOS-9579:


Another instance observed today on Ubuntu 14.04:
{noformat}
20:42:56 [ RUN  ] ExecutorHttpApiTest.HeartbeatCalls
20:42:56 I0305 20:42:56.060261 28896 executor.cpp:206] Version: 1.8.0
20:42:56 W0305 20:42:56.060288 28896 process.cpp:2829] Attempted to spawn 
already running process version@172.16.10.87:33003
20:42:56 I0305 20:42:56.060858 28899 executor.cpp:432] Connected with the agent
20:42:56 F0305 20:42:56.060952 28899 owned.hpp:112] Check failed: 'get()' Must 
be non NULL 
20:42:56 *** Check failure stack trace: ***
20:42:56 @ 0x7fb09b359ead  google::LogMessage::Fail()
20:42:56 @ 0x7fb09b35bcdd  google::LogMessage::SendToLog()
20:42:56 @ 0x7fb09b359a9c  google::LogMessage::Flush()
20:42:56 @ 0x7fb09b35c5d9  google::LogMessageFatal::~LogMessageFatal()
20:42:56 @ 0x7fb09d1d79fd  google::CheckNotNull<>()
20:42:56 @ 0x7fb09d1be8c4  
_ZNSt17_Function_handlerIFvvEZN5mesos8internal5tests39ExecutorHttpApiTest_HeartbeatCalls_Test8TestBodyEvEUlvE_E9_M_invokeERKSt9_Any_data
20:42:56 @ 0x7fb09a1441a0  process::AsyncExecutorProcess::execute<>()
20:42:56 @ 0x7fb09a153908  
_ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_
20:42:56 @ 0x7fb09b2ac961  process::ProcessBase::consume()
20:42:56 @ 0x7fb09b2bfbcc  process::ProcessManager::resume()
20:42:56 @ 0x7fb09b2c5596  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
20:42:56 @ 0x7fb09753da60  (unknown)
20:42:56 @ 0x7fb096d5a184  start_thread
20:42:56 @ 0x7fb096a8703d  (unknown)
20:42:56 timeout: the monitored command dumped core
20:42:56 The test binary has crashed OR the timeout has been exceeded!
{noformat}

> ExecutorHttpApiTest.HeartbeatCalls is flaky.
> 
>
> Key: MESOS-9579
> URL: https://issues.apache.org/jira/browse/MESOS-9579
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.8.0
> Environment: Centos 6
>Reporter: Till Toenshoff
>Priority: Major
>  Labels: flaky, flaky-test
>
> I just saw this failing on our internal CI:
> {noformat}
> 21:42:35 [ RUN ] ExecutorHttpApiTest.HeartbeatCalls
> 21:42:35 I0215 21:42:35.917752 17173 executor.cpp:206] Version: 1.8.0
> 21:42:35 W0215 21:42:35.917771 17173 process.cpp:2829] Attempted to spawn 
> already running process version@172.16.10.166:35439
> 21:42:35 I0215 21:42:35.918581 17174 executor.cpp:432] Connected with the 
> agent
> 21:42:35 F0215 21:42:35.918857 17174 owned.hpp:112] Check failed: 'get()' 
> Must be non NULL 
> 21:42:35 *** Check failure stack trace: ***
> 21:42:35 @ 0x7fb93ce1d1dd google::LogMessage::Fail()
> 21:42:35 @ 0x7fb93ce1ee7d google::LogMessage::SendToLog()
> 21:42:35 @ 0x7fb93ce1cdb3 google::LogMessage::Flush()
> 21:42:35 @ 0x7fb93ce1f879 google::LogMessageFatal::~LogMessageFatal()
> 21:42:35 @ 0x55e80a099f76 google::CheckNotNull<>()
> 21:42:35 @ 0x55e80a07dde4 
> _ZNSt17_Function_handlerIFvvEZN5mesos8internal5tests39ExecutorHttpApiTest_HeartbeatCalls_Test8TestBodyEvEUlvE_E9_M_invokeERKSt9_Any_data
> 21:42:35 @ 0x7fb93baea260 process::AsyncExecutorProcess::execute<>()
> 21:42:35 @ 0x7fb93baf62cb 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEESG_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISW_EEOSE_S3_E_JSZ_SE_St12_PlaceholderILi1EEclEOS3_
> 21:42:36 @ 0x7fb93cd646b1 process::ProcessBase::consume()
> 21:42:36 @ 0x7fb93cd794ba process::ProcessManager::resume()
> 21:42:36 @ 0x7fb93cd7d486 
> _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 21:42:36 @ 0x7fb93d02a1af execute_native_thread_routine
> 21:42:36 @ 0x7fb939794aa1 start_thread
> 21:42:36 @ 0x7fb938b39c4d clone
> 21:42:36 The test binary has crashed OR the timeout has been exceeded!
> 21:42:36 ~/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-centos-6
> 21:42:36 mkswap: /tmp/swapfile: warning: don't erase bootbits sectors
> 21:42:36 on whole disk. Use -f to force.
> 21:42:36 Setting up swapspace version 1, size = 8388604 KiB
> 21:42:36 no label, UUID=dda5aa26-dba6-4ac8-bc6c-41264f510694
> 21:42:36 gcc (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
> 21:42:36 Copyright (C)

[jira] [Commented] (MESOS-9322) Executor exited accidentally, but mesos-agent did not report TASK_FAILED event.

2019-02-14 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768514#comment-16768514
 ] 

Alexander Rukletsov commented on MESOS-9322:


[~guoshiwei] I agree and think it is a bug, too. We recently have at least two 
bugs related to "zombie executors":
* MESOS-9502, stuck IOSwitchboard
* MESOS-8125, MESOS-9501 pid reusal
I would like to ask you to provide us with executor and agent logs, so we can 
determine whether you see one of the aforementioned issues or this is a 
separate bug.

> Executor exited accidentally, but mesos-agent did not report TASK_FAILED 
> event.
> ---
>
> Key: MESOS-9322
> URL: https://issues.apache.org/jira/browse/MESOS-9322
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.1
> Environment: Linux n14-068-081 4.4.0-33.bm.1-amd64 #1 SMP Fri, 01 Sep 
> 2017 18:36:21 +0800 x86_64 GNU/Linux
> OS: debion 8.10
> mesos version: 1.4.1
>Reporter: Shiwei Guo
>Priority: Major
>
> The log about this executor:
> executorid: 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  
> {noformat}
> I0914 10:40:36.448287 2505 slave.cpp:7336] Recovering executor 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67-
> I0914 10:40:36.479209 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/189e4b23-c892-4c87-9069-dfc98ca5edc8'
>  for gc 3.1546935280563days in the future
> I0914 10:40:36.479287 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/meta/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/189e4b23-c892-4c87-9069-dfc98ca5edc8'
>  for gc 3.15469352761481days in the future
> I0914 10:40:36.479310 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85'
>  for gc -1.02171850967407days in the future
> I0914 10:40:36.479337 2511 gc.cpp:58] Scheduling 
> '/opt/tiger/mesos_deploy/mesos_titan/slave/meta/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85'
>  for gc -1.02171850987259days in the future
> I0914 10:40:36.480459 2514 gc.cpp:169] Deleting 
> /opt/tiger/mesos_deploy/mesos_titan/slave/slaves/03def54c-f3f0-4ea5-a886-93fae5e570fa-S3473/frameworks/ae7c9e78-e0b7-4110-8092-52baf64e4f67-/executors/gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz/runs/4b27d1d4-fe67-4475-88bc-14e994acfb85
> I0914 10:40:36.552492 2516 status_update_manager.cpp:211] Recovering executor 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67-
> I0914 10:40:36.553234 2519 containerizer.cpp:665] Recovering container 
> 106c7257-fabb-4d58-8fcb-89b15bb9d404 for executor 
> 'gn:aweme.recommend.cypher_recent.default;ps:aweme.recommend.cypher_recent_default;sg:263;tp:Companion;nm:aweme_cypher_recent;executor:systemd-mesos-executor-0.2.10.tar.gz'
>  of framework ae7c9e78-e0b7-4110-8092-52baf64e4f67-
> I0914 10:40:36.591421 2514 gc.cpp:177] Deleted 
>

[jira] [Created] (MESOS-9562) Authorization for DESTROY and UNRESERVE is not symmetrical.

2019-02-08 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9562:
--

 Summary: Authorization for DESTROY and UNRESERVE is not 
symmetrical.
 Key: MESOS-9562
 URL: https://issues.apache.org/jira/browse/MESOS-9562
 Project: Mesos
  Issue Type: Improvement
  Components: master, scheduler api
Affects Versions: 1.7.1
Reporter: Alexander Rukletsov


For [the {{UNRESERVE}} 
case|https://github.com/apache/mesos/blob/5d3ed364c6d1307d88e6b950ae0eef423c426673/src/master/master.cpp#L3661-L3677],
 if the principal was not set, {{.has_principal()}} will be {{false}}, hence we 
will not call {{authorizations.push_back()}}, and hence we will not create an 
authz request with this resource as an object. For [the {{DESTROY}} 
case|https://github.com/apache/mesos/blob/5d3ed364c6d1307d88e6b950ae0eef423c426673/src/master/master.cpp#L3772-L3773],
 if the principal was not set, a default value {{""}} for string will be used 
and hence we will create an authz request with this resource as an object. 

We definitely need to make the behaviour consistent. I'm not sure which 
approach is correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.

2019-02-06 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-9143:
--

Assignee: (was: Greg Mann)

> MasterQuotaTest.RemoveSingleQuota is flaky.
> ---
>
> Key: MESOS-9143
> URL: https://issues.apache.org/jira/browse/MESOS-9143
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: RemoveSingleQuota-badrun.txt
>
>
> {noformat}
> ../../src/tests/master_quota_tests.cpp:493
> Value of: metrics.at(metricKey).isNone()
>   Actual: false
> Expected: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9533) CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.

2019-01-30 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755853#comment-16755853
 ] 

Alexander Rukletsov commented on MESOS-9533:


I've reopened this because I have observed the same failure on the {{1.7.x}} 
branch. I've also set up fix versions to match those in MESOS-9518 since I 
suppose that are the branches where the test have been back introduced.

> CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.
> -
>
> Key: MESOS-9533
> URL: https://issues.apache.org/jira/browse/MESOS-9533
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.8.0
> Environment: centos-6 with SSL enabled
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: cni, flaky-test
> Fix For: 1.4.3, 1.5.3, 1.6.2, 1.7.2, 1.8.0
>
>
> {noformat}
> Error Message
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> Stacktrace
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> {noformat}
> It was from this commit 
> https://github.com/apache/mesos/commit/c338f5ada0123c0558658c6452ac3402d9fbec29



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-3123) DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged fails & crashes

2019-01-08 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-3123:
--

Assignee: (was: Timothy Chen)

> DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged fails & crashes
> ---
>
> Key: MESOS-3123
> URL: https://issues.apache.org/jira/browse/MESOS-3123
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, test
>Affects Versions: 0.23.0
> Environment: CentOS 7.1, CentOS 6.6, or Ubuntu 14.04
> Mesos 0.23.0-rc4 or today's master
> Docker 1.9
>Reporter: Adam B
>Priority: Major
>  Labels: disabled-test, mesosphere
> Fix For: 0.26.0
>
>
> Fails the test and then crashes while trying to shutdown the slaves.
> {code}
> [ RUN  ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged
> ../../src/tests/docker_containerizer_tests.cpp:618: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_LOST
> Expected: TASK_RUNNING
> ../../src/tests/docker_containerizer_tests.cpp:619: Failure
> Failed to wait 1mins for statusFinished
> ../../src/tests/docker_containerizer_tests.cpp:610: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> F0721 21:59:54.950773 30622 logging.cpp:57] RAW: Pure virtual method called
> @ 0x7f3915347a02  google::LogMessage::Fail()
> @ 0x7f391534cee4  google::RawLog__()
> @ 0x7f3914890312  __cxa_pure_virtual
> @   0x88c3ae  mesos::internal::tests::Cluster::Slaves::shutdown()
> @   0x88c176  mesos::internal::tests::Cluster::Slaves::~Slaves()
> @   0x88dc16  mesos::internal::tests::Cluster::~Cluster()
> @   0x88dc87  mesos::internal::tests::MesosTest::~MesosTest()
> @   0xa529ab  
> mesos::internal::tests::DockerContainerizerTest::~DockerContainerizerTest()
> @   0xa8125f  
> mesos::internal::tests::DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test::~DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test()
> @   0xa8128e  
> mesos::internal::tests::DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test::~DockerContainerizerTest_ROOT_DOCKER_Launch_Executor_Bridged_Test()
> @  0x1218b4e  testing::Test::DeleteSelf_()
> @  0x1221909  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x121cb38  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1205713  testing::TestInfo::Run()
> @  0x1205c4e  testing::TestCase::Run()
> @  0x120a9ca  testing::internal::UnitTestImpl::RunAllTests()
> @  0x122277b  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x121d81b  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x120987a  testing::UnitTest::Run()
> @   0xcfbf0c  main
> @ 0x7f391097caf5  __libc_start_main
> @   0x882089  (unknown)
> make[3]: *** [check-local] Aborted (core dumped)
> make[3]: Leaving directory `/home/me/mesos/build/src'
> make[2]: *** [check-am] Error 2
> make[2]: Leaving directory `/home/me/mesos/build/src'
> make[1]: *** [check] Error 2
> make[1]: Leaving directory `/home/me/mesos/build/src'
> make: *** [check-recursive] Error 1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-6780) ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably

2019-01-08 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-6780:
--

Assignee: (was: Kevin Klues)

> ContentType/AgentAPIStreamingTest.AttachContainerInput test fails reliably
> --
>
> Key: MESOS-6780
> URL: https://issues.apache.org/jira/browse/MESOS-6780
> Project: Mesos
>  Issue Type: Bug
> Environment: Mac OS 10.12, clang version 4.0.0 
> (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) 
> (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), 
> libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: disabled-test, flaky-test, mesosphere
> Attachments: attach_container_input_no_ssl.log
>
>
> The test {{ContentType/AgentAPIStreamingTest.AttachContainerInput}} (both 
> {{/0}} and {{/1}}) fail consistently for me in an SSL-enabled, optimized 
> build.
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ContentType/AgentAPIStreamingTest
> [ RUN  ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1212 17:11:12.393844 17362944 master.cpp:380] Master 
> c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on 
> 172.18.8.114:51059
> I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master"
>  --zk_session_timeout="10secs"
> I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials'
> I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL
> I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled
> I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master!
> I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar
> I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 4.131072ms
> I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in 
> 27us; attempting to update the registry
> I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the 
> registry in 4.10496ms
> I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered 
> registrar
> I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the 
> registry (136B); allowing 10mins for agents to re-register
> I1212 17:11:12.422780 3971208128 containerizer.cpp:220] Using isolation: 
>

[jira] [Assigned] (MESOS-7023) IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky

2018-12-27 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7023:
--

Assignee: (was: Kevin Klues)

> IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky
> ---
>
> Key: MESOS-7023
> URL: https://issues.apache.org/jira/browse/MESOS-7023
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, test
>Affects Versions: 1.2.2
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, without libevent/SSL
>Reporter: Greg Mann
>Priority: Major
>  Labels: debugging, disabled-test, flaky
> Attachments: IOSwitchboardTest. 
> RecoverThenKillSwitchboardContainerDestroyed.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/containerizer/io_switchboard_tests.cpp:1052: Failure
> Value of: statusFailed->reason()
>   Actual: 1
> Expected: TaskStatus::REASON_IO_SWITCHBOARD_EXITED
> Which is: 27
> {code}
> Find full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8252) MasterAuthorizationTest.SlaveRemovedLost is flaky.

2018-12-27 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8252:
--

Assignee: (was: Alexander Rojas)

> MasterAuthorizationTest.SlaveRemovedLost is flaky.
> --
>
> Key: MESOS-8252
> URL: https://issues.apache.org/jira/browse/MESOS-8252
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: SlaveRemovedLost-badrun.txt
>
>
> Observed it in the internal CI today. Most likely related to the recent 
> introduction of {{Abandoned}} future state. Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9491) There exists no way to statically configure a weight for a Mesos role

2018-12-27 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729570#comment-16729570
 ] 

Alexander Rukletsov commented on MESOS-9491:


[~bbannier] Why do you static configuration would be useful? We wanted to move 
away from a concept of statically defining roles in a cluster.

> There exists no way to statically configure a weight for a Mesos role
> -
>
> Key: MESOS-9491
> URL: https://issues.apache.org/jira/browse/MESOS-9491
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Bannier
>Priority: Major
>
> While it is possible to change the weight of any role at runtime over the 
> operator API, it seems we currently have no supported way to configure this 
> statically with configuration flags. Both the {{\-\-weights}} and {{--roles}} 
> flag would in principle allow this, but are deprecated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9499) Mesos supports only digest authentication scheme for Zookeeper.

2018-12-24 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-9499:
--

Assignee: Dmitrii Kishchukov

> Mesos supports only digest authentication scheme for Zookeeper.
> ---
>
> Key: MESOS-9499
> URL: https://issues.apache.org/jira/browse/MESOS-9499
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.6.1, 1.7.0, 1.8.0
>Reporter: Alexander Rukletsov
>Assignee: Dmitrii Kishchukov
>Priority: Major
>  Labels: authentication, zookeeper
>
> Zookeeper has quite a flexible security model, of which Mesos supports digest 
> authentication only. This tickets aims to extend ZK authentication support in 
> Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9499) Mesos supports only digest authentication scheme for Zookeeper.

2018-12-21 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9499:
--

 Summary: Mesos supports only digest authentication scheme for 
Zookeeper.
 Key: MESOS-9499
 URL: https://issues.apache.org/jira/browse/MESOS-9499
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.7.0, 1.6.1, 1.8.0
Reporter: Alexander Rukletsov


Zookeeper has quite a flexible security model, of which Mesos supports digest 
authentication only. This tickets aims to extend ZK authentication support in 
Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.

2018-11-27 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700415#comment-16700415
 ] 

Alexander Rukletsov commented on MESOS-9419:


I'd like to understand, why the user has not observed the issue prior to 
\{{1.5.x}}. [~chhsia0], when you say the issue "appears to be present there as 
well", does it mean you run your test against \{{1.0.x}}?

> Executor to framework message crashes master if framework has not 
> re-registered.
> 
>
> Key: MESOS-9419
> URL: https://issues.apache.org/jira/browse/MESOS-9419
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Benjamin Mahler
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>
> If the executor sends a framework message after a master failover, and the 
> framework has not yet re-registered with the master, this will crash the 
> master:
> {code}
> W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send 
> message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- 
> (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb)
> F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE
> *** Check failure stack trace: ***
> *** @ 0x7f09e016b6cd google::LogMessage::Fail()
> *** @ 0x7f09e016d38d google::LogMessage::SendToLog()
> *** @ 0x7f09e016b2b3 google::LogMessage::Flush()
> *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal()
> *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal()
> *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>()
> *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage()
> *** @ 0x7f09df3b06a4 
> _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\
>  
> _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\
>  _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 
> std::_Function_handler<>::_M_invoke()
> *** @ 0x7f09df36930f ProtobufProcess<>::consume()
> *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume()
> *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume()
> *** @ 0x7f09e00d9c7a process::ProcessManager::resume()
> *** @ 0x7f09e00dd836 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> *** @ 0x7f09dd467ac8 execute_native_thread_routine
> *** @ 0x7f09dd6f6b50 start_thread
> *** @ 0x7f09dcc7030d (unknown)
> {code}
> This is because Framework::send proceeds if the framework is disconnected. In 
> the case of a recovered framework, it will not have a pid or http connection 
> yet:
> https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610
> {code}
> // Sends a message to the connected framework.
> template 
> void Framework::send(const Message& message)
> {
>   if (!connected()) {
> LOG(WARNING) << "Master attempted to send message to disconnected"
>  << " framework " << *this;
> // XXX proceeds!
>   }
>   metrics.incrementEvent(message);
>   if (http.isSome()) {
> if (!http->send(message)) {
>   LOG(WARNING) << "Unable to send event to framework " << *this << ":"
><< " connection closed";
> }
>   } else {
> CHECK_SOME(pid); // XXX Will crash.
> master->send(pid.get(), message);
>   }
> }
> {code}
> The executor to framework path does not guard against the framework being 
> disconnected, unlike the status update path:
> https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495
> vs.
> https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373
> It was reported that this crash didn't occur for the user on 1.2.0, however 
> the issue appears to present there as well, so we will try to backport a test 
> to see if it's indeed not occurring in 1.2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7991) fatal, check failed !framework->recovered()

2018-11-23 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7991:
--

Assignee: (was: Alexander Rukletsov)

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Priority: Critical
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.
> {code}
> The issue happened again on Mesos 1.5 (docker mesos master from the 
> mesosphere docker repo):
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311

[jira] [Commented] (MESOS-7991) fatal, check failed !framework->recovered()

2018-11-23 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697136#comment-16697136
 ] 

Alexander Rukletsov commented on MESOS-7991:


An update from a user: "The failure in this case seems to happen right after an 
agent drops out of the cluster - which is a similar failure condition to the 
first time I encountered this".
{noformat}
Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 
http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 
master.cpp:5467] Processing DECLINE call for offers: [ 
5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 
5e57f633-a69c-4009-b7
Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 
master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 
master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 
registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the 
registry
Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 
registrar.cpp:552] Successfully updated the registry in 175872ns
Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 
master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 
hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 
master.cpp:10018] Check failed: 'framework' Must be non NULL
Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d  
google::LogMessage::Fail()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830  
google::LogMessage::SendToLog()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663  
google::LogMessage::Flush()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259  
google::LogMessageFatal::~LogMessageFatal()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14  
google::CheckNotNull<>()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8  
mesos::internal::master::Master::__removeSlave()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2  
mesos::internal::master::Master::_markUnreachable()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11  
process::ProcessBase::consume()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a  
process::ProcessManager::resume()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80  (unknown)
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba  start_thread
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d  (unknown)
Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) 
try "date -d @1520762676" if you are using GNU date ***
Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown)
Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 
(TID 0x7f96b986d700) from PID 0; stack trace: ***
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown)
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown)
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c 
google::DumpStackTraceAndExit()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d 
google::LogMessage::Fail()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 
google::LogMessage::SendToLog()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 
google::LogMessage::Flush()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 
google::LogMessageFatal::~LogMessageFatal()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 
google::CheckNotNull<>()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 
mesos::internal::master::Master::__removeSlave()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 
mesos::internal::master::Master::_markUnreachable()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 
process::ProcessBase::consume()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a 
process::ProcessManager::resume()
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown)
Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread

[jira] [Assigned] (MESOS-7748) Slow subscribers of streaming APIs can lead to Mesos OOMing.

2018-11-23 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7748:
--

Assignee: (was: Alexander Rukletsov)

> Slow subscribers of streaming APIs can lead to Mesos OOMing.
> 
>
> Key: MESOS-7748
> URL: https://issues.apache.org/jira/browse/MESOS-7748
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Priority: Critical
>  Labels: mesosphere, reliability
>
> For each active subscriber, Mesos master / slave maintains an event queue, 
> which grows over time if the subscriber does not read fast enough. As the 
> number of such "slow" subscribers grows, so does Mesos master / slave memory 
> consumption, which might lead to an OOM event.
> Ideas to consider:
> * Restrict the number of subscribers for the streaming APIs
> * Check (ping) for inactive or "slow" subscribers
> * Disconnect the subscriber when there are too many queued events in memory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8975) Problem and solution overview for the slow API issue.

2018-11-20 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693568#comment-16693568
 ] 

Alexander Rukletsov edited comment on MESOS-8975 at 11/20/18 5:41 PM:
--

{noformat}
commit 40dc508d59d547e867746bc6b5b82ced849687f8
Author: Alexander Rukletsov 
AuthorDate: Sun Nov 18 05:09:39 2018 +0100
Commit: Alexander Rukletsov 
CommitDate: Tue Nov 20 18:37:42 2018 +0100

Added MasterActorResponsiveness_BENCHMARK_Test.

See summary.

Review: https://reviews.apache.org/r/68131/
{noformat}


was (Author: alexr):
{noformat}
Author: Alexander Rukletsov 
AuthorDate: Sun Nov 18 05:09:39 2018 +0100
Commit: Alexander Rukletsov 
CommitDate: Tue Nov 20 18:37:42 2018 +0100

Added MasterActorResponsiveness_BENCHMARK_Test.

See summary.

Review: https://reviews.apache.org/r/68131/
{noformat}

> Problem and solution overview for the slow API issue.
> -
>
> Key: MESOS-8975
> URL: https://issues.apache.org/jira/browse/MESOS-8975
> Project: Mesos
>  Issue Type: Task
>  Components: HTTP API
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: benchmark, performance
> Fix For: 1.8.0
>
>
> Collect data from the clusters regarding {{state.json}} responsiveness, 
> figure out, where the bottlenecks are, and prepare an overview of solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9395) Check failure on

2018-11-16 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9395:
--

 Summary: Check failure on 
 Key: MESOS-9395
 URL: https://issues.apache.org/jira/browse/MESOS-9395
 Project: Mesos
  Issue Type: Bug
  Components: resource provider
Affects Versions: 1.7.0
Reporter: Alexander Rukletsov


Observed the following agent failure on one of our staging clusters:
{noformat}
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: I1116 11:57:24.641331 26684 http.cpp:1799] Processing 
GET_AGENT call
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: I1116 11:57:24.650429 26679 http.cpp:1117] HTTP POST for 
/slave(1)/api/v1/resource_provider from 172.31.8.65:57790
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: I1116 11:57:24.650629 26679 manager.cpp:672] Subscribing 
resource provider 
{"attributes":[{"name":"lvm-vg-name","text":{"value":"lvm-double-1540383639"},"type":"SCALAR"},{"name":"dss-asset-id","text":{"value":"6AbZV6W2DrK4YgcIR3ICVo"},"type":"SCALAR"}],"default_reservations":[{"principal":"storage-principal","role":"dcos-storage","type":"DYNAMIC"}],"id":{"value":"8326e931-41f2-4f45-9174-13fe35c19300"},"name":"rp_6AbZV6W2DrK4YgcIR3ICVo","storage":{"plugin":{"containers":[{"command":{"environment":{"variables":[{"name":"PATH","type":"VALUE","value":"/opt/mesosphere/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"},{"name":"LD_LIBRARY_PATH","type":"VALUE","value":"/opt/mesosphere/lib"},{"name":"CONTAINER_LOGGER_DESTINATION_TYPE","type":"VALUE","value":"journald+logrotate"},{"name":"CONTAINER_LOGGER_EXTRA_LABELS","type":"VALUE","value":"{\"CSI_PLUGIN\":\"csilvm\"}"}]},"shell":true,"uris":[{"executable":true,"extract":false,"value":""}],"value":"echo
 \"a *:* rwm\" > /sys/fs/cgroup/devices`cat /proc/self/cgroup | grep devices | 
cut -d : -f 3`/devices.allow; exec ./csilvm -devices=/dev/xvdk,/dev/xvdj 
-volume-group=lvm-double-1540383639 -unix-addr-env=CSI_ENDPOINT 
-tag=6AbZV6W2DrK4YgcIR3ICVo"},"resources":[{"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"name":"mem","scalar":{"value":128.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":10.0},"type":"SCALAR"}],"services":["CONTROLLER_SERVICE","NODE_SERVICE"]}],"name":"plugin_6AbZV6W2DrK4YgcIR3ICVo","type":"io.mesosphere.dcos.storage.csilvm"}},"type":"org.apache.mesos.rp.local.storage"}
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: I1116 11:57:24.690474 26685 provider.cpp:546] Received 
SUBSCRIBED event
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: I1116 11:57:24.690521 26685 provider.cpp:1492] Subscribed 
with ID 8326e931-41f2-4f45-9174-13fe35c19300
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: I1116 11:57:24.690657 26681 
status_update_manager_process.hpp:314] Recovering operation status update 
manager
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: F1116 11:57:24.691496 26682 provider.cpp:3121] Check 
failed: resource.disk().source().has_profile() != 
resource.disk().source().has_id() (1 vs. 1)
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: *** Check failure stack trace: ***
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb099e9fd  google::LogMessage::Fail()
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb09a082d  google::LogMessage::SendToLog()
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb099e5ec  google::LogMessage::Flush()
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb09a1129  
google::LogMessageFatal::~LogMessageFatal()
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb01654ca  
mesos::internal::StorageLocalResourceProviderProcess::applyCreateDisk()
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb017c683  
mesos::internal::StorageLocalResourceProviderProcess::_applyOperation()
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb017d64a  
_ZZN5mesos8internal35StorageLocalResourceProviderProcess26reconcileOperationStatusesEvENKUlRKNS0_26StatusUpdateManagerProcessIN2id4UUIDENS0_27UpdateOperationStatusRecordENS0_28UpdateOperationStatusMessageEE5StateEE_clESA_
Nov 16 11:57:24 int-mountvolumeagent2-soak112s.testing.mesosphe.re 
mesos-agent[26663]: @ 0x7fecb017dd21

[jira] [Commented] (MESOS-8723) ROOT_HealthCheckUsingPersistentVolume is flaky.

2018-11-07 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678453#comment-16678453
 ] 

Alexander Rukletsov commented on MESOS-8723:


This ^ bad run is likely https://jira.apache.org/jira/browse/MESOS-8096

> ROOT_HealthCheckUsingPersistentVolume is flaky.
> ---
>
> Key: MESOS-8723
> URL: https://issues.apache.org/jira/browse/MESOS-8723
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
> Environment: ec2's CentOS 7 with SSL
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_HealthCheckUsingPersistentVolume-badrun.txt
>
>
> {noformat}
> ../../src/tests/cluster.cpp:660: Failure
> Failed to wait 15secs for destroy
> I0321 19:45:11.676262  8064 master.cpp:1137] Master terminating
> I0321 19:45:11.676625 27242 hierarchical.cpp:609] Removed agent 
> b7675b9a-d9e9-4c97-a5c2-d50fc6101301-S0
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8780) Expose Check and HealthCheck information on Mesos HTTP endpoints.

2018-10-22 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658741#comment-16658741
 ] 

Alexander Rukletsov commented on MESOS-8780:


Let's keep this one open: it's good to have checks and health checks as much in 
sync as possible.

> Expose Check and HealthCheck information on Mesos HTTP endpoints.
> -
>
> Key: MESOS-8780
> URL: https://issues.apache.org/jira/browse/MESOS-8780
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Adam Medziński
>Assignee: Greg Mann
>Priority: Minor
>  Labels: api, integration, mesosphere
>
> Is the information about task health check definition not exposed on Mesos 
> HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific 
> reason? I'm working on integration with Hashicorp Consul and it would allow 
> me to synchronize the definitions of health checks only by using HTTP API. If 
> this information is not exposed by accident, I will gladly make a pull 
> request.
> This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and 
> {{v1}} APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-6417) Introduce an extra 'unknown' health check state.

2018-10-16 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-6417:
--

Shepherd: Alexander Rukletsov
Assignee: Greg Mann  (was: Alexander Rukletsov)
  Sprint: Mesosphere RI-6 Sprint 2018-31
Story Points: 5

> Introduce an extra 'unknown' health check state.
> 
>
> Key: MESOS-6417
> URL: https://issues.apache.org/jira/browse/MESOS-6417
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Greg Mann
>Priority: Major
>  Labels: health-check, mesosphere
>
> There are three logical states regarding health checks:
> 1) no health checks;
> 2) a health check is defined, but no result is available yet;
> 3) a health check is defined, it is either healthy or not.
> Currently, we do not distinguish between 1) and 2), which can be problematic 
> for framework authors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8780) Expose Check and HealthCheck information on Mesos HTTP endpoints.

2018-10-16 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8780:
--

Shepherd: Greg Mann
Assignee: Alexander Rukletsov
  Sprint: Mesosphere RI-6 Sprint 2018-31
Story Points: 5
  Labels: api integration mesosphere  (was: )
 Description: 
Is the information about task health check definition not exposed on Mesos HTTP 
endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific reason? 
I'm working on integration with Hashicorp Consul and it would allow me to 
synchronize the definitions of health checks only by using HTTP API. If this 
information is not exposed by accident, I will gladly make a pull request.

This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and 
{{v1}} APIs.

  was:Is the information about task health check definition not exposed on 
Mesos HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific 
reason? I'm working on integration with Hashicorp Consul and it would allow me 
to synchronize the definitions of health checks only by using HTTP API. If this 
information is not exposed by accident, I will gladly make a pull request.

 Component/s: HTTP API
  Issue Type: Improvement  (was: Story)
 Summary: Expose Check and HealthCheck information on Mesos HTTP 
endpoints.  (was: Expose HealthCheck information on Mesos HTTP endpoints)

> Expose Check and HealthCheck information on Mesos HTTP endpoints.
> -
>
> Key: MESOS-8780
> URL: https://issues.apache.org/jira/browse/MESOS-8780
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Adam Medziński
>Assignee: Alexander Rukletsov
>Priority: Minor
>  Labels: api, integration, mesosphere
>
> Is the information about task health check definition not exposed on Mesos 
> HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific 
> reason? I'm working on integration with Hashicorp Consul and it would allow 
> me to synchronize the definitions of health checks only by using HTTP API. If 
> this information is not exposed by accident, I will gladly make a pull 
> request.
> This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and 
> {{v1}} APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9317) Some master endpoints do not handle failed authorization properly.

2018-10-15 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9317:
--

 Summary: Some master endpoints do not handle failed authorization 
properly.
 Key: MESOS-9317
 URL: https://issues.apache.org/jira/browse/MESOS-9317
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Alexander Rukletsov


When we authorize _some_ actions (right now I see this happening to create / 
destroy volumes, reserve / unreserve resources) *and* {{authorizer}} fails 
(i.e. returns the future in non-ready state), an assertion is triggered:
{noformat}
mesos-master[49173]: F1015 11:40:29.795748 49396 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attempts
{noformat}
This is due to incorrect assumption in our code, see for example 
[https://github.com/apache/mesos/blob/a063afce9868dcee38a0ab7efaa028244f3999cf/src/master/master.cpp#L3752-L3763]:
{noformat}
  return await(authorizations)
  .then([](const vector>& authorizations)
-> Future {
// Compute a disjunction.
foreach (const Future& authorization, authorizations) {
  if (!authorization.get()) {
return false;
  }
}
return true;
  });
{noformat}
Futures returned from {{await}} are guaranteed to be in terminal state, but not 
necessarily ready! In the snippet above, {{!authorization.get()}} is invoked 
without being checked ⇒ assertion fails.

Full stack trace:
{noformat}
Oct 15 11:40:39 int-master2-mwst9.scaletesting.mesosphe.re mesos-master[49173]: 
F1015 11:40:29.795748 49396 future.hpp:1306] Check failed: !isFailed() 
Future::get() but state == FAILED: Failed to retrieve permissions from IAM at 
url https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
request failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796037 49395 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796097 49384 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796249 49393 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796375 49390 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796483 49388 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796629 49381 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796700 49385 future.hpp:1306] Check failed: 
!isFailed() Future::get() but state == FAILED: Failed to retrieve permissions 
from IAM at url 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the request 
failed: Failed to contact bouncer at 
https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to time 
out after 3 attemptsF1015 11:40:29.796780 49386 future.hpp:1306] Check failed:

[jira] [Commented] (MESOS-9277) UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN.

2018-10-14 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649340#comment-16649340
 ] 

Alexander Rukletsov commented on MESOS-9277:


IIUC, scheduler API is a stream hence using {{process::Sequence}} should be 
sufficient.

> UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN. 
> 
>
> Key: MESOS-9277
> URL: https://issues.apache.org/jira/browse/MESOS-9277
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, v1_api
>
> A typical use pattern for a framework scheduler is to remove its reservations 
> before tearing itself down. However, it is racy: {{UNRESERVE}} is a 
> multi-stage action which aborts if the framework is removed in-between.
> *Solution 1*
> Let schedulers use operation feedback and expect them to wait for an ack for 
> {{UNRESERVE}} before they send {{TEARDOWN}}. Kind of science fiction with a 
> timeline of {{O(months)}} and still possibilities for the race if a scheduler 
> does not comply.
> *Solution 2*
> Serialize calls for schedulers. For example, we can chain [handlers 
> here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711]
>  onto per-{{Master::Framework}} 
> [{{process::Sequence}}|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/3rdparty/libprocess/include/process/sequence.hpp].
>  For that however, handlers must provide futures indicating when the 
> processing of the call is finished, note that most [handlers 
> here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711]
>  return void.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7693) DEBUG container does not inherit env variable properly for command tasks.

2018-10-13 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7693:
--

Assignee: (was: Alexander Rukletsov)

> DEBUG container does not inherit env variable properly for command tasks.
> -
>
> Key: MESOS-7693
> URL: https://issues.apache.org/jira/browse/MESOS-7693
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Jie Yu
>Priority: Major
>
> I can repo the issue:
> {code}
> sudo /home/vagrant/workspace/dist/mesos-1.4.0/bin/mesos-execute 
> --master=172.28.128.3:5050 --name=java8 --docker_image=java:8 
> --command="sleep 1000"
> I0618 17:42:21.410598  3356 scheduler.cpp:184] Version: 1.4.0
> I0618 17:42:21.413465  3356 scheduler.cpp:470] New master detected at 
> master@172.28.128.3:5050
> Subscribed with ID cacf5c08-cbbc-401a-a84d-2cfc4edc6519-0006
> Submitted task 'java8' to agent 'cacf5c08-cbbc-401a-a84d-2cfc4edc6519-S0'
> Received status update TASK_RUNNING for task 'java8'
>   source: SOURCE_EXECUTOR
> Jies-MacBook-Pro:script jie$ ./dcos task
> NAME   HOST  USER  STATE  ID
> java8  172.28.128.3  rootRjava8
> Jies-MacBook-Pro:script jie$ ./dcos task exec -t -i java8 bash
> root@vagrant-ubuntu-trusty-64:/mnt/mesos/sandbox# env
> LIBPROCESS_IP=172.28.128.3
> MESOS_AGENT_ENDPOINT=172.28.128.3:5051
> MESOS_DIRECTORY=/tmp/mesos/slave/slaves/cacf5c08-cbbc-401a-a84d-2cfc4edc6519-S0/frameworks/cacf5c08-cbbc-401a-a84d-2cfc4edc6519-0006/executors/java8/runs/1b06c661-20f3-460a-8cfd-475dc3e60aa3
> MESOS_EXECUTOR_ID=java8
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> PWD=/mnt/mesos/sandbox
> MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD=5secs
> MESOS_NATIVE_JAVA_LIBRARY=/home/vagrant/workspace/dist/mesos-1.4.0/lib/libmesos-1.4.0.so
> MESOS_NATIVE_LIBRARY=/home/vagrant/workspace/dist/mesos-1.4.0/lib/libmesos-1.4.0.so
> MESOS_HTTP_COMMAND_EXECUTOR=0
> MESOS_SLAVE_PID=slave(1)@172.28.128.3:5051
> MESOS_FRAMEWORK_ID=cacf5c08-cbbc-401a-a84d-2cfc4edc6519-0006
> MESOS_CHECKPOINT=0
> SHLVL=1
> LIBPROCESS_PORT=0
> MESOS_SLAVE_ID=cacf5c08-cbbc-401a-a84d-2cfc4edc6519-S0
> MESOS_SANDBOX=/mnt/mesos/sandbox
> _=/usr/bin/env
> {code}
> As you can see, environment variables like JAVA_HOME defined in the docker 
> image are not in the debug container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8907) curl fetcher fails with HTTP/2

2018-10-10 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645579#comment-16645579
 ] 

Alexander Rukletsov commented on MESOS-8907:


[~tillt] the fix sounds reasonable to me, however, I'd like to confirm first, 
that the version of curl used in Ubuntu 18 started using HTTP/2 by default, 
which was the case for Ubuntu 16.

> curl fetcher fails with HTTP/2
> --
>
> Key: MESOS-8907
> URL: https://issues.apache.org/jira/browse/MESOS-8907
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: James Peach
>Priority: Major
>  Labels: integration
>
> {noformat}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> ...
> I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
> 'quay.io/coreos/alpine-sh' from 
> 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
> '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
> E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
> '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
> 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
> 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
> HTTP responses: Decoding failed
> HTTP/2 200
> server: nginx/1.13.12
> date: Fri, 11 May 2018 03:52:00 GMT
> content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
> content-length: 4486
> docker-content-digest: 
> sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> ...
> {noformat}
> Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
> curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
> prepared to parse that.
> {noformat}
> $ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
> HTTP/1.1 301 Moved Permanently
> Content-Type: text/html
> Date: Fri, 11 May 2018 04:07:44 GMT
> Location: https://quay.io/coreos/alpine-sh?latest
> Server: nginx/1.13.12
> Content-Length: 186
> Connection: keep-alive
> HTTP/2 301
> server: nginx/1.13.12
> date: Fri, 11 May 2018 04:07:45 GMT
> content-type: text/html; charset=utf-8
> content-length: 287
> location: https://quay.io/coreos/alpine-sh/?latest
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8999) Add default bodies for libprocess HTTP error responses.

2018-10-07 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8999:
--

Shepherd: Alexander Rukletsov
Assignee: Benno Evers
  Sprint: Mesosphere RI-6 Sprint 2018-30
Story Points: 3
  Labels: mesosphere observability  (was: )
 Component/s: libprocess

> Add default bodies for libprocess HTTP error responses.
> ---
>
> Key: MESOS-8999
> URL: https://issues.apache.org/jira/browse/MESOS-8999
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere, observability
>
> By default on error libprocess would only return a response
> with the correct status code and no response body.
> However, most browsers do not visually indicate the response
> status code, so if any error occurs anyone using a browser will only
> see a blank page, making it hard to figure out what happened.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9298) Task failures sometimes can't be understood without looking into agent logs.

2018-10-07 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9298:
--

 Summary: Task failures sometimes can't be understood without 
looking into agent logs.
 Key: MESOS-9298
 URL: https://issues.apache.org/jira/browse/MESOS-9298
 Project: Mesos
  Issue Type: Epic
  Components: scheduler api
Reporter: Alexander Rukletsov


Mesos communicates task state transitions via task status updates. They often 
include a reason, which aims to hint what exactly went wrong. However, these 
reasons are often:
- misleading
- vague
- generic.
Needless to say, this complicates triaging why the task has actually failed and 
hence is a bad user experience. The failures can come from a bunch of different 
sources: fetcher, isolators (including custom ones!), namespace setup, etc.

This epic aims to improve the UX by providing detailed, ideally typed, 
information about task failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.

2018-10-05 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639361#comment-16639361
 ] 

Alexander Rukletsov commented on MESOS-9274:


*Backports to 1.7.1:*
{noformat}
830a7d53218ae472d10cf5733dab2c13600638b2
f8ba9e3f4fb1bb8fe7d0e35bd3d92696cb8381a7
{noformat}
*Backports to 1.6.2:*
{noformat}
6ec452b7ecaae63a1eb79416b58ac5916c3fff6c
e26e907ff72670877af6b7868634df335d04006d
{noformat}
These patches can't be back ported to 1.5.x because [the scheduler 
library|https://github.com/apache/mesos/blob/ba960ed45e80119eadf398abd72609538fbc983e/include/mesos/v1/scheduler.hpp#L65]
 does not provide {{call()}} method there, which [was 
introduced|https://github.com/apache/mesos/commit/c39ef69514e57ca7c90e764a4a617abf88cd144f#diff-008387c75189aa7afcf0726f8d22530b]
 in Mesos 1.6.0.

> v1 JAVA scheduler library can drop TEARDOWN upon destruction.
> -
>
> Key: MESOS-9274
> URL: https://issues.apache.org/jira/browse/MESOS-9274
> Project: Mesos
>  Issue Type: Bug
>  Components: java api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: api, mesosphere, scheduler
> Fix For: 1.6.2, 1.7.1, 1.8.0
>
>
> Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent 
> to the master nor waits for responses. This can be problematic if the library 
> is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: 
> destruction of the underlying {{Mesos}} actor races with sending the call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.

2018-10-04 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630527#comment-16630527
 ] 

Alexander Rukletsov edited comment on MESOS-9274 at 10/4/18 9:01 AM:
-

I see several possible solutions here:
* Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is 
sent. This is out of our control hence does not seem like a good solution or 
user experience
* Add {{sleep(5)}} in 
[{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258].
 This is a hacky solution but it [_follows the 
pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082]
 ;).
* Use {{Mesos::call()}} instead of {{Mesos::send()}} and wait for the response 
in {{v1Mesos::send()}}. This seems like the cleanest solution.


was (Author: alexr):
I see several possible solutions here:
* Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is 
sent. This is out of our control hence does not seem like a good solution or 
user experience
* Add {{sleep(5)}} in 
[{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258].
 This is a hacky solution but it [_follows the 
pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082]
 ;).
* Use {[Mesos::call()}} instead of {{Mesos::send()}} and wait for the response 
in {{v1Mesos::send()}}. This seems like the cleanest solution.

> v1 JAVA scheduler library can drop TEARDOWN upon destruction.
> -
>
> Key: MESOS-9274
> URL: https://issues.apache.org/jira/browse/MESOS-9274
> Project: Mesos
>  Issue Type: Bug
>  Components: java api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: api, mesosphere, scheduler
>
> Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent 
> to the master nor waits for responses. This can be problematic if the library 
> is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: 
> destruction of the underlying {{Mesos}} actor races with sending the call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.

2018-09-28 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276
 ] 

Alexander Rukletsov edited comment on MESOS-9116 at 9/28/18 2:31 PM:
-

Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}


was (Author: alexr):
Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}
Backports to 1.4.x (partial):
{noformat}
c37eb59e4c4b7b6c16509f317c78207da6eeb485
{noformat}

> Launch nested container session fails due to incorrect detection of `mnt` 
> namespace of command executor's task.
> ---
>
> Key: MESOS-9116
> URL: https://issues.apache.org/jira/browse/MESOS-9116
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.4.2, 1.5.1, 1.6.1, 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.5.2, 1.6.2, 1.7.0
>
> Attachments: pstree.png
>
>
> Launch nested container call might fail with the following error:
> {code:java}
> Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such 
> file or directory
> {code}
> This happens when the containerizer launcher [tries to 
> enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892]
>  `mnt` namespace using the pid of a terminated process. The pid [was 
> detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958]
>  by the agent before spawning the containerizer launcher process, because the 
> process was running back then.
> The issue can be reproduced using the following test (pseudocode):
> {code:java}
> launchTask("sleep 1000")
> parentContainerId = containerizer.containers().begin()
> outputs = []
> for i in range(10):
>   ContainerId containerId
>   containerId.parent = parentContainerId
>   containerId.id = UUID.random()
>   LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo")
>   response = ATTACH_CONTAINER_OUTPUT(containerId)
>   outputs.append(response.reader)
> for output in outputs:
>   stdout, stderr = getProcessIOData(output)
>   assert("echo" == stdout + stderr){code}
> When we start the very first nested container, `getMountNamespaceTarget()` 
> returns a PID of the task (`sleep 1000`), because it's the only process whose 
> `mnt` namespace differs from the parent container. This nested container 
> becomes a child of PID 1 process, which is also a parent of the command 
> executor. It's not an executor's child! It can be seen in attached 
> `pstree.png`.
> When we start a second nested container, `getMountNamespaceTarget()` might 
> return PID of the previous nested container (`echo echo`) instead of the 
> task's PID (`sleep 1000`). It happens because the first nested container 
> entered `mnt` namespace of the task. Then, the containerizer launcher 
> ("nanny" process) attempts to enter `mnt` namespace using the PID of a 
> terminated process, so we get this error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9277) UNRESERVE scheduler call be dropped if it loses the race with TEARDOWN.

2018-09-28 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9277:
--

 Summary: UNRESERVE scheduler call be dropped if it loses the race 
with TEARDOWN. 
 Key: MESOS-9277
 URL: https://issues.apache.org/jira/browse/MESOS-9277
 Project: Mesos
  Issue Type: Bug
  Components: scheduler api
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Alexander Rukletsov


A typical use pattern for a framework scheduler is to remove its reservations 
before tearing itself down. However, it is racy: {{UNRESERVE}} is a multi-stage 
action which aborts if the framework is removed in-between.

*Solution 1*
Let schedulers use operation feedback and expect them to wait for an ack for 
{{UNRESERVE}} before they send {{TEARDOWN}}. Kind of science fiction with a 
timeline of {{O(months)}} and still possibilities for the race if a scheduler 
does not comply.

*Solution 2*
Serialize calls for schedulers. For example, we can chain [handlers 
here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711]
 onto per-{{Master::Framework}} 
[{{process::Sequence}}|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/3rdparty/libprocess/include/process/sequence.hpp].
 For that however, handlers must provide futures indicating when the processing 
of the call is finished, note that most [handlers 
here|https://github.com/apache/mesos/blob/6e21e94ddca5b776d44636fe3eba8500bf88dc25/src/master/http.cpp#L640-L711]
 return void.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.

2018-09-27 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630527#comment-16630527
 ] 

Alexander Rukletsov commented on MESOS-9274:


I see several possible solutions here:
* Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is 
sent. This is out of our control hence does not seem like a good solution or 
user experience
* Add {{sleep(5)}} in 
[{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258].
 This is a hacky solution but it [_follows the 
pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082]
 ;).
* Use {[Mesos::call()}} instead of {{Mesos::send()}} and wait for the response 
in {{v1Mesos::send()}}. This seems like the cleanest solution.

> v1 JAVA scheduler library can drop TEARDOWN upon destruction.
> -
>
> Key: MESOS-9274
> URL: https://issues.apache.org/jira/browse/MESOS-9274
> Project: Mesos
>  Issue Type: Bug
>  Components: java api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: api, mesosphere, scheduler
>
> Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent 
> to the master nor waits for responses. This can be problematic if the library 
> is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: 
> destruction of the underlying {{Mesos}} actor races with sending the call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.

2018-09-27 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9274:
--

 Summary: v1 JAVA scheduler library can drop TEARDOWN upon 
destruction.
 Key: MESOS-9274
 URL: https://issues.apache.org/jira/browse/MESOS-9274
 Project: Mesos
  Issue Type: Bug
  Components: java api, scheduler driver
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent to 
the master nor waits for responses. This can be problematic if the library is 
destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: 
destruction of the underlying {{Mesos}} actor races with sending the call.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9257) AgentAPITest.LaunchNestedContainerSessionsInParallel is flaky

2018-09-26 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629460#comment-16629460
 ] 

Alexander Rukletsov commented on MESOS-9257:


Disabled this test for now in {{af5af29ce217f63aeec59bed81f2a742d2c5602a}}.

> AgentAPITest.LaunchNestedContainerSessionsInParallel is flaky
> -
>
> Key: MESOS-9257
> URL: https://issues.apache.org/jira/browse/MESOS-9257
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
> Environment: Debian \{8, 9} SSL
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky-test
> Attachments: LaunchNestedContainerSessionsInParallel-badrun.txt
>
>
> {code:java}
> ../../src/tests/api_tests.cpp:6641: Failure
> Expected: "echo\n"
> To be equal to: stdoutReceived + stderrReceived
> Which is: "
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9261) PersistentVolumeTest.ShrinkVolume is flaky

2018-09-26 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628550#comment-16628550
 ] 

Alexander Rukletsov commented on MESOS-9261:


I don't see anything in the log that can hint why task {{test `cat path1/file` 
= abc}} has failed.

> PersistentVolumeTest.ShrinkVolume is flaky
> --
>
> Key: MESOS-9261
> URL: https://issues.apache.org/jira/browse/MESOS-9261
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run:
> {noformat}
> ../../src/tests/persistent_volume_tests.cpp:832
>   Expected: TASK_FINISHED
> To be equal to: taskFinished->state()
>   Which is: TASK_FAILED
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] DiskResource/PersistentVolumeTest.ShrinkVolume/0
> I0925 23:58:13.544659 21740 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0925 23:58:13.545785  9453 master.cpp:413] Master 
> 9f8d4b56-de4c-4df6-86d9-92a6c3c9e432 (ip-172-16-10-34.ec2.internal) started 
> on 172.16.10.34:35358
> I0925 23:58:13.545801  9453 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/tf2SmN/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/tf2SmN/master" --zk_session_timeout="10secs"
> I0925 23:58:13.545931  9453 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0925 23:58:13.545939  9453 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0925 23:58:13.545945  9453 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0925 23:58:13.545951  9453 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/tf2SmN/credentials'
> I0925 23:58:13.546041  9453 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0925 23:58:13.546085  9453 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0925 23:58:13.546119  9453 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0925 23:58:13.546149  9453 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0925 23:58:13.546174  9453 master.cpp:602] Authorization enabled
> I0925 23:58:13.546268  9457 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0925 23:58:13.546294  9457 whitelist_watcher.cpp:77] No whitelist given
> I0925 23:58:13.546878  9458 master.cpp:2083] Elected as the leading master!
> I0925 23:58:13.546891  9458 master.cpp:1638] Recovering from registrar
> I0925 23:58:13.546941  9453 registrar.cpp:339] Recovering registrar
> I0925 23:58:13.547065  9453 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 0ns
> I0925 23:58:13.547092  9453 registrar.cpp:487] Applied 1 operations in 
> 7135ns; attempting to update the registry
> I0925 23:58:13.547225  9453 registrar.cpp:544] Successfully updated the 
> registry in 0ns
> I0925 23:58:13.547250  9453 registrar.cpp:416] Successfully recovered 
> registrar
> I0925 23:58:13.547319  9453 master.cpp:1752] Recovered 0 agents from the 
> registry (172B); allowing 10mins for agents to reregister
> I0925 23:58:13.547336  9457 hierarchical.cpp:220] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W0925 23:58:13.549054 21740 process.cpp:2810] Attempted to spawn already 
> running process files@172.16.10.34:35358
> I0925 23:58:13.549363 21740 containerizer.cpp:305] Using isolation

[jira] [Commented] (MESOS-9262) ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_DTYPE_Whiteout is flaky

2018-09-26 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628547#comment-16628547
 ] 

Alexander Rukletsov commented on MESOS-9262:


{noformat}
E0925 23:59:40.077899  3539 slave.cpp:6162] Container 
'c09a3eb9-7d46-4ff0-8b70-ec87d7adf2e2' for executor 
'c32a603e-7202-4534-a218-3116d8d5bb34' of framework 
34b4dabd-2b7c-4ba6-bccf-4dfa968087a1- failed to start: Collect failed: 
Failed to perform 'curl': curl: (52) Empty reply from server
{noformat}
I wonder whether this is related to the recent flavour of MESOS-7425 and why we 
use weird images like {{zhq527725/whiteout}}?

> ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_DTYPE_Whiteout is flaky
> ---
>
> Key: MESOS-9262
> URL: https://issues.apache.org/jira/browse/MESOS-9262
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run (4489):
> {noformat}
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:915
>   Expected: TASK_STARTING
> To be equal to: statusStarting->state()
>   Which is: TASK_FAILED
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] 
> BackendFlag/ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_DTYPE_Whiteout/0
> I0925 23:59:24.750632 21740 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0925 23:59:24.752059  3540 master.cpp:413] Master 
> 34b4dabd-2b7c-4ba6-bccf-4dfa968087a1 (ip-172-16-10-34.ec2.internal) started 
> on 172.16.10.34:41596
> I0925 23:59:24.752087  3540 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/f5XfyH/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/f5XfyH/master" --zk_session_timeout="10secs"
> I0925 23:59:24.752307  3540 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0925 23:59:24.752393  3540 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0925 23:59:24.752409  3540 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0925 23:59:24.752418  3540 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/f5XfyH/credentials'
> I0925 23:59:24.752590  3540 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0925 23:59:24.752715  3540 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0925 23:59:24.752769  3540 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0925 23:59:24.752804  3540 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0925 23:59:24.752835  3540 master.cpp:602] Authorization enabled
> I0925 23:59:24.753206  3539 whitelist_watcher.cpp:77] No whitelist given
> I0925 23:59:24.753266  3544 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0925 23:59:24.753803  3540 master.cpp:2083] Elected as the leading master!
> I0925 23:59:24.753823  3540 master.cpp:1638] Recovering from registrar
> I0925 23:59:24.753863  3540 registrar.cpp:339] Recovering registrar
> I0925 23:59:24.754007  3540 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 130048ns
> I0925 23:59:24.754041  3540 registrar.cpp:487] Applied 1 operations in 
> 8734ns; attempting to update the registry
> I0925 23:59:24.754166  3540 registrar.cpp:544] Successfully updated the 
> registry in 108032ns
> I0925 23:59:24.754195  3540 registrar.cpp:416]

[jira] [Commented] (MESOS-9264) NestedContainerCniTest.ROOT_INTERNET_CURL_VerifyContainerHostname is flaky

2018-09-26 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628542#comment-16628542
 ] 

Alexander Rukletsov commented on MESOS-9264:


Apparently, {{library/alpine}} could not be fetched in 15s?

> NestedContainerCniTest.ROOT_INTERNET_CURL_VerifyContainerHostname is flaky
> --
>
> Key: MESOS-9264
> URL: https://issues.apache.org/jira/browse/MESOS-9264
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run: (4488)
> {noformat}
> ../../src/tests/containerizer/cni_isolator_tests.cpp:1969
> Failed to wait 15secs for updateRunning
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] 
> JoinParentsNetworkParam/NestedContainerCniTest.ROOT_INTERNET_CURL_VerifyContainerHostname/0
> I0925 22:02:08.400498 11809 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0925 22:02:08.401520 30157 master.cpp:413] Master 
> d800b4fe-ffe8-4a9c-b6cb-93f9ce4d0c8c (ip-172-16-10-238.ec2.internal) started 
> on 172.16.10.238:41592
> I0925 22:02:08.401608 30157 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/p8aET3/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/p8aET3/master" --zk_session_timeout="10secs"
> I0925 22:02:08.401738 30157 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0925 22:02:08.401749 30157 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0925 22:02:08.401756 30157 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0925 22:02:08.401762 30157 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/p8aET3/credentials'
> I0925 22:02:08.401834 30157 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0925 22:02:08.401882 30157 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0925 22:02:08.401932 30157 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0925 22:02:08.401965 30157 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0925 22:02:08.401998 30157 master.cpp:602] Authorization enabled
> I0925 22:02:08.402230 30163 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0925 22:02:08.402434 30163 whitelist_watcher.cpp:77] No whitelist given
> I0925 22:02:08.402696 30157 master.cpp:2083] Elected as the leading master!
> I0925 22:02:08.402716 30157 master.cpp:1638] Recovering from registrar
> I0925 22:02:08.402823 30157 registrar.cpp:339] Recovering registrar
> I0925 22:02:08.403005 30157 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 158208ns
> I0925 22:02:08.403045 30157 registrar.cpp:487] Applied 1 operations in 
> 8612ns; attempting to update the registry
> I0925 22:02:08.403218 30156 registrar.cpp:544] Successfully updated the 
> registry in 128768ns
> I0925 22:02:08.403431 30156 registrar.cpp:416] Successfully recovered 
> registrar
> I0925 22:02:08.403694 30157 hierarchical.cpp:220] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0925 22:02:08.403750 30161 master.cpp:1752] Recovered 0 agents from the 
> registry (176B); allowing 10mins for agents to reregister
> W0925 22:02:08.405280 11809 process.cpp:2810] Attempted to spawn already 
> running process files@172.16.10.238:41592
> I0925 22:02:08.405745

[jira] [Comment Edited] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2018-09-25 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604170#comment-16604170
 ] 

Alexander Rukletsov edited comment on MESOS-8096 at 9/25/18 12:23 PM:
--

Might be related to this issue, from {{clang-analyzer}}, courtesy of [~mcypark]:
{noformat}
src/scheduler/scheduler.cpp:911:5: warning: Call to virtual function during 
destruction will not dispatch to derived class 
[clang-analyzer-optin.cplusplus.VirtualCall]
stop();
^
{noformat}
Likely a hypothetical control flow starting from 
{{src/tests/http_fault_tolerance_tests.cpp:872}}
{noformat}
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5:
 warning: Use of memory after it is freed [clang-analyzer-cplusplus.NewDelete]
return function_mocker_->AddNewExpectation(
^
/tmp/SRC/src/tests/http_fault_tolerance_tests.cpp:872:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1845:32:
 note: expanded from macro 'EXPECT_CALL'
#define EXPECT_CALL(obj, call) GMOCK_EXPECT_CALL_IMPL_(obj, call)
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1844:5:
 note: expanded from macro 'GMOCK_EXPECT_CALL_IMPL_'
((obj).gmock_##call).InternalExpectedAt(__FILE__, __LINE__, #obj, #call)
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12:
 note: Calling 'FunctionMockerBase::AddNewExpectation'
return function_mocker_->AddNewExpectation(
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
 note: Memory is allocated
new TypedExpectation(this, file, line, source_text, m);
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:9:
 note: Assuming 'implicit_sequence' is equal to NULL
if (implicit_sequence != NULL) {
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:5:
 note: Taking false branch
if (implicit_sequence != NULL) {
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13:
 note: Calling '~linked_ptr'
return *expectation;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19:
 note: Calling 'linked_ptr::depart'
  ~linked_ptr() { depart(); }
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:5:
 note: Taking true branch
if (link_.depart()) delete value_;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:25:
 note: Memory is released
if (link_.depart()) delete value_;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19:
 note: Returning; memory was released
  ~linked_ptr() { depart(); }
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13:
 note: Returning from '~linked_ptr'
return *expectation;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12:
 note: Returning; memory was released
return function_mocker_->AddNewExpectation(
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5:
 note: Use of memory after it is freed
return function_mocker_->AddNewExpectation(
^
{noformat}
There are what seems to be equivalent output for the following places:
{noformat}
/tmp/SRC/src/tests/uri_fetcher_tests.cpp:140:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(server, test(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/default_executor_tests.cpp:2042:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/scheduler_tests.cpp:2037:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/fetcher_tests.cpp:535:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*http.process, test(_))
  ^
{noformat}
Of all the {{EXPECT_CALL}} s in the codebase, these are the only instances that 
are pointed out. It is still unclear that there's an issue here, but it seems 
worth checking out, especially since these files are known-flaky.


was (Author: alexr):
Might be related to this issue, from {{clang-analyzer}},

[jira] [Assigned] (MESOS-1719) Master should persist framework information

2018-09-24 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-1719:
--

Assignee: (was: Yongqiao Wang)

> Master should persist framework information
> ---
>
> Key: MESOS-1719
> URL: https://issues.apache.org/jira/browse/MESOS-1719
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Priority: Major
>  Labels: mesosphere, reliability
>
> https://issues.apache.org/jira/browse/MESOS-1219 disallows completed 
> frameworks from re-registering with the same framework id, as long as the 
> master doesn't failover.
> This ticket tracks the work for it work across the master failover using 
> registrar.
> There are some open questions that need to be addressed:
> --> Should registry contain framework ids only framework infos.
> For disallowing completed frameworks from re-registering, persisting 
> framework ids is enough. But, if in the future, we want to disallow
> frameworks from re-registering if some parts of framework info
> changed then we need to persist the info too.
> --> How to update the framework info.
>   Currently frameworks are allowed to update framework info while re-
>   registering, but it only takes effect on the master when the master 
> fails 
>   over and on the slave when the slave fails over. How should things 
>change when persist framework info?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-21 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
 ] 

Alexander Rukletsov edited comment on MESOS-8545 at 9/21/18 1:01 PM:
-

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
response, so the agent got broken HTTP connection for the call.
This patch introduces an acknowledgment for the received response
for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new
type of control messages for the `ATTACH_CONTAINER_INPUT` call. When
IOSwitchboard receives an acknowledgment, and io redirects are
finished, it terminates itself. That guarantees that the agent always
receives a response for the `ATTACH_CONTAINER_INPUT` call.

Review: https://reviews.apache.org/r/65168/
{noformat}
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
response, so the agent got broken HTTP connection for the call.
This patch introduces an acknowledgment for the received response
for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new
type of control messages for the `ATTACH_CONTAINER_INPUT` call. When
IOSwitchboard receives an acknowledgment, and io redirects are
finished, it terminates itself. That guarantees that the agent always
receives a response for the `ATTACH_CONTAINER_INPUT` call.

Review: https://reviews.apache.org/r/65168/
{noformat}
{noformat}
commit bfa2bd24780b5c49467b3c23260855e3d8b4c948
Author: Andrei Budnik 
AuthorDate: Fri Sep 21 14:51:24 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Fri Sep 21 14:51:24 2018 +0200

Fixed disconnection while sending acknowledgment to IOSwitchboard.

Previously, an HTTP connection to the IOSwitchboard could be garbage
collected before the agent sent an acknowledgment to the IOSwitchboard
via this connection. This patch fixes the issue by keeping a reference
count to the connection in a lambda callback until disconnection
occurs.

Review: https://reviews.apache.org/r/68768/
{noformat}
{noformat}
commit c3c77cbef818d497d8bd5e67fa72e55a7190e27a
Author: Andrei Budnik 
AuthorDate: Fri Sep 21 14:51:59 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Fri Sep 21 14:51:59 2018 +0200

Fixed broken pipe error in IOSwitchboard.

Previous attempt to fix `HTTP 500` "broken pipe" in review /r/62187/
was not correct: after IOSwitchboard sends a response to the agent for
the `ATTACH_CONTAINER_INPUT` call, the socket is closed immediately,
thus causing the error on the agent. This patch adds a delay after
IO redirects are finished and before IOSwitchboard forcibly send a
response.

Review: https://reviews.apache.org/r/68784/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679
commit 021a8f4de1ad65167946548e3ecfa74d8e41e9c5
commit 38a914398b6f1aaf08db4f62f4e42cdb80127eb5
{noformat}
*{{1.6.2}}*:
{noformat}
commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303
commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac
commit 55b0e94f0c8a1896ca079361d89527123faf22c6
commit c40c92b7710b5b238b13ce6f1bacd3d75e04283b
{noformat}
*{{1.5.2}}*:
{noformat}
commit 3bf4fe22e0ed828a36d5b2ea652d07c6eef4b578
commit 33a6bec95b44592d626874ae8deaa3e2a3bbc120
commit 7b8195680104c2c5f61073a956f60ac961c37f45
commit 0216002744517a6053fd782b6b4dc3d6cf77dd5e
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
response, so the agent got broken

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-21 Thread Alexander Rukletsov (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
]

Alexander Rukletsov edited comment on MESOS-8545 at 9/21/18 12:56 PM:
--

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
response, so the agent got broken HTTP connection for the call.
This patch introduces an acknowledgment for the received response
for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new
type of control messages for the `ATTACH_CONTAINER_INPUT` call. When
IOSwitchboard receives an acknowledgment, and io redirects are
finished, it terminates itself. That guarantees that the agent always
receives a response for the `ATTACH_CONTAINER_INPUT` call.

Review: https://reviews.apache.org/r/65168/
{noformat}
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
{noformat}
{noformat}
commit bfa2bd24780b5c49467b3c23260855e3d8b4c948
Author: Andrei Budnik
AuthorDate: Fri Sep 21 14:51:24 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Fri Sep 21 14:51:24 2018 +0200

Fixed disconnection while sending acknowledgment to IOSwitchboard.

Previously, an HTTP connection to the IOSwitchboard could be garbage
collected before the agent sent an acknowledgment to the IOSwitchboard
via this connection. This patch fixes the issue by keeping a reference
count to the connection in a lambda callback until disconnection
occurs.

Review: https://reviews.apache.org/r/68768/
{noformat}
{noformat}
commit c3c77cbef818d497d8bd5e67fa72e55a7190e27a
Author: Andrei Budnik
AuthorDate: Fri Sep 21 14:51:59 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Fri Sep 21 14:51:59 2018 +0200

Fixed broken pipe error in IOSwitchboard.

Previous attempt to fix `HTTP 500` "broken pipe" in review /r/62187/
was not correct: after IOSwitchboard sends a response to the agent for
the `ATTACH_CONTAINER_INPUT` call, the socket is closed immediately,
thus causing the error on the agent. This patch adds a delay after
IO redirects are finished and before IOSwitchboard forcibly send a
response.

Review: https://reviews.apache.org/r/68784/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679
{noformat}
*{{1.6.2}}*:
{noformat}
commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303
commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac
{noformat}
*{{1.5.2}}*:
{noformat}
commit 3bf4fe22e0ed828a36d5b2ea652d07c6eef4b578
commit 33a6bec95b44592d626874ae8deaa3e2a3bbc120
{noformat}

was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks.

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415
 ] 

Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 6:14 PM:
-

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
{noformat}
*{{1.6.2}}*:
{noformat}
commit e3a9eb3b473a10f210913d568c1d9923ed05d933
commit a1798ae1fb2249280f4a4e9fec69eb9e37b95452
commit d82177d00a4a25d70aab172a91c855ad6b07f768
{noformat}
*{{1.5.2}}*:
{noformat}
commit 5a5089938f13a5aafc0a4ee3308f33e76374c408
commit 25de60746de4681ed0d858cba0790372f03ff840
commit fa6eb85fd2a8798842855628495c16664bc68652
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-18 Thread Alexander Rukletsov (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
]

Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 6:14 PM:
-

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679
{noformat}
*{{1.6.2}}*:
{noformat}
commit 2ddd6f07bebbe91e1e0d5165c4a5ae552b836303
commit c1448f36d4c2c2c8345e7e8d1bf1f206dba18dac
{noformat}
*{{1.5.2}}*:
{noformat}
commit 3bf4fe22e0ed828a36d5b2ea652d07c6eef4b578
commit 33a6bec95b44592d626874ae8deaa3e2a3bbc120
{noformat}

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
commit

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-18 Thread Alexander Rukletsov (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
]

Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 5:58 PM:
-

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679
{noformat}

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
>

[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks.

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415
 ] 

Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:57 PM:
-

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
{noformat}
*{{1.6.2}}*:
{noformat}
commit e3a9eb3b473a10f210913d568c1d9923ed05d933
commit a1798ae1fb2249280f4a4e9fec69eb9e37b95452
commit d82177d00a4a25d70aab172a91c855ad6b07f768
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
{noformat}

> Health checks launching nested containers while a container is being 
> destroyed lead to unkillable tasks.
>

[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415
 ] 

Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:44 PM:
-

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-18 Thread Alexander Rukletsov (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
]

Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 5:44 PM:
-

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679
{noformat}

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
{noformat}
*{{1.7.1}}*:
{noformat}
commit 1672941630960cccf66ed81b11811d84e8a4e3f0
Author: Andrei Budnik
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-18 Thread Alexander Rukletsov (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
]

Alexander Rukletsov edited comment on MESOS-8545 at 9/18/18 5:43 PM:
-

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Review: https://reviews.apache.org/r/65168/
(cherry picked from commit 5b95bb0f21852058d22703385f2c8e139881bf1a)
{noformat}
{noformat}
commit 600b388e25c49f4fac4d39bc07bcf6ffce42c679
Author: Andrei Budnik
AuthorDate: Tue Sep 18 19:10:20 2018 +0200
Commit: Alexander Rukletsov
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Fixed broken pipe error in IOSwitchboard.

We force IOSwitchboard to return a final response to the client for the
`ATTACH_CONTAINER_INPUT` call after IO redirects are finished. In this
case, we don't read remaining messages from the input stream. So the
agent might send an acknowledgment for the request before IOSwitchboard
has received remaining messages. We need to delay termination of
IOSwitchboard to give it a chance to read the remaining messages.
Otherwise, the agent might get `HTTP 500` "broken pipe" while
attempting to write the final message.

Review: https://reviews.apache.org/r/62187/
(cherry picked from commit c5cf4d49f47579b5a6cb7afc2f7df7c8f51dc6d0)
{noformat}

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`

[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415
 ] 

Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:42 PM:
-

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
(cherry picked from commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0)
{noformat}
{noformat}
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
(cherry picked from commit e941d206f651bde861675a6517a89e44d1f61a34)
{noformat}
{noformat}
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
(cherry picked from commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4)
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't

[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415
 ] 

Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:41 PM:
-

*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}
*{{master}} aka {{1.7.1}}*:
{noformat}
commit e9605a6243db41c1bbc85ec9ade112f2ef806c15
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
(cherry picked from commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0)
{noformat}
{noformat}
commit f672afef601c71d69a9eb4db3c191bacfe167d3e
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
(cherry picked from commit e941d206f651bde861675a6517a89e44d1f61a34)
{noformat}
{noformat}
commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:27:17 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
(cherry picked from commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4)
{noformat}


was (Author: alexr):
*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`

[jira] [Commented] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619417#comment-16619417
 ] 

Alexander Rukletsov commented on MESOS-8545:


*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
response, so the agent got broken HTTP connection for the call.
This patch introduces an acknowledgment for the received response
for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new
type of control messages for the `ATTACH_CONTAINER_INPUT` call. When
IOSwitchboard receives an acknowledgment, and io redirects are
finished, it terminates itself. That guarantees that the agent always
receives a response for the `ATTACH_CONTAINER_INPUT` call.

Review: https://reviews.apache.org/r/65168/
{noformat}
{noformat}
commit 5b95bb0f21852058d22703385f2c8e139881bf1a
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:14 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:14 2018 +0200

Fixed HTTP errors caused by dropped HTTP responses by IOSwitchboard.

Previously, IOSwitchboard process could terminate before all HTTP
responses had been sent to the agent. In the case of
`ATTACH_CONTAINER_INPUT` call, we could drop a final HTTP `200 OK`
response, so the agent got broken HTTP connection for the call.
This patch introduces an acknowledgment for the received response
for the `ATTACH_CONTAINER_INPUT` call. This acknowledgment is a new
type of control messages for the `ATTACH_CONTAINER_INPUT` call. When
IOSwitchboard receives an acknowledgment, and io redirects are
finished, it terminates itself. That guarantees that the agent always
receives a response for the `ATTACH_CONTAINER_INPUT` call.

Review: https://reviews.apache.org/r/65168/
{noformat}

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0, 1.6.1, 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619415#comment-16619415
 ] 

Alexander Rukletsov commented on MESOS-9131:


*{{master}} aka {{1.8-dev}}*:
{noformat}
commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:09:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:09:31 2018 +0200

Fixed IOSwitchboard waiting EOF from attach container input request.

Previously, when a corresponding nested container terminated, while the
user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT`
IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting
for EOF message from the input HTTP connection. Since the IOSwitchboard
was stuck, the corresponding nested container was also stuck in
`DESTROYING` state.

This patch fixes the aforementioned issue by sending 200 `OK` response
for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is
finished while reading from the HTTP input connection is not.

Review: https://reviews.apache.org/r/68232/
{noformat}
{noformat}
commit e941d206f651bde861675a6517a89e44d1f61a34
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:01 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:01 2018 +0200

Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test.

This test verifies that IOSwitchboard, which holds an open HTTP input
connection, terminates once IO redirects finish for the corresponding
nested container.

Review: https://reviews.apache.org/r/68230/
{noformat}
{noformat}
commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4
Author: Andrei Budnik 
AuthorDate: Tue Sep 18 19:10:07 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Sep 18 19:10:07 2018 +0200

Added `AgentAPITest.AttachContainerInputRepeat` test.

This test verifies that we can call `ATTACH_CONTAINER_INPUT` more
than once. We send a short message first then we send a long message
in chunks.

Review: https://reviews.apache.org/r/68231/
{noformat}

> Health checks launching nested containers while a container is being 
> destroyed lead to unkillable tasks
> ---
>
> Key: MESOS-9131
> URL: https://issues.apache.org/jira/browse/MESOS-9131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.5.1
>Reporter: Jan Schlicht
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: container-stuck
> Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> A container might get stuck in {{DESTROYING}} state if there's a command 
> health check that starts new nested containers while its parent container is 
> getting destroyed.
> Here are some logs which unrelated lines removed. The 
> `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping 
> afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] 
> Container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has 
> exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] 
> Destroying container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in 
> RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] 
> Transitioning the state of container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 
> from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] 
> Asked to destroy container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] 
> Using freezer to destroy cgroup 
> mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing 
> cgroup 
>

[jira] [Commented] (MESOS-9241) Delimiters in endpoint names are inconsistent across mesos components.

2018-09-18 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619185#comment-16619185
 ] 

Alexander Rukletsov commented on MESOS-9241:


A brief searching reveals that [there are arguments for 
both|https://stackoverflow.com/questions/10302179/hyphen-underscore-or-camelcase-as-word-delimiter-in-uris],
 however for REST and REST-like APIs underscore {{_}} seems the standard de 
facto:
https://api.stripe.com/v1/subscription_items
https://developer.twitter.com/en/docs/api-reference-index.html
https://www.graph.facebook.com///finance_permissions?user=_permission=

Hence the suggestion is to standardise on {{_}} in Mesos.

> Delimiters in endpoint names are inconsistent across mesos components.
> --
>
> Key: MESOS-9241
> URL: https://issues.apache.org/jira/browse/MESOS-9241
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Alexander Rukletsov
>Priority: Minor
>  Labels: api, tech-debt
>
> At the moment endpoints in Mesos components have both {{-}} and {{_}} as 
> delimiters:
> {noformat}
> /master/create-volumes
> /master/destroy-volumes
> /master/state-summary
> /slave(1)/api/v1/resource_provider
> {noformat}
> This is inconsistency for no good reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9241) Delimiters in endpoint names are inconsistent across mesos components.

2018-09-18 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9241:
--

 Summary: Delimiters in endpoint names are inconsistent across 
mesos components.
 Key: MESOS-9241
 URL: https://issues.apache.org/jira/browse/MESOS-9241
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Alexander Rukletsov


At the moment endpoints in Mesos components have both {{-}} and {{_}} as 
delimiters:
{noformat}
/master/create-volumes
/master/destroy-volumes
/master/state-summary
/slave(1)/api/v1/resource_provider
{noformat}
This is inconsistency for no good reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7121) Make IO Switchboard optional for debug containers

2018-09-14 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7121:
--

Shepherd: Alexander Rukletsov
Assignee: Andrei Budnik
  Sprint: Mesosphere Sprint 2018-29
Story Points: 5

> Make IO Switchboard optional for debug containers
> -
>
> Key: MESOS-7121
> URL: https://issues.apache.org/jira/browse/MESOS-7121
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Gastón Kleiman
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: debugging, health-check, mesosphere, performance
>
> Starting a new IO switchboard for each debug container adds some overhead.
> The functionality provided by the IO switchboard is not always necessary, so 
> we should make the IO switchboard optional in order to improve the 
> performance of launching nested containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8975) Problem and solution overview for the slow API issue.

2018-09-14 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8975:
--

Shepherd: Alexander Rukletsov
Assignee: Benno Evers  (was: Alexander Rukletsov)

> Problem and solution overview for the slow API issue.
> -
>
> Key: MESOS-8975
> URL: https://issues.apache.org/jira/browse/MESOS-8975
> Project: Mesos
>  Issue Type: Task
>  Components: HTTP API
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: performance
>
> Collect data from the clusters regarding {{state.json}} responsiveness, 
> figure out, where the bottlenecks are, and prepare an overview of solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9224) De-duplicate read-only requests to master based on principal.

2018-09-10 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9224:
--

 Summary: De-duplicate read-only requests to master based on 
principal.
 Key: MESOS-9224
 URL: https://issues.apache.org/jira/browse/MESOS-9224
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Alexander Rukletsov
Assignee: Benno Evers


"Identical" read-only requests can be batched and answered together. With 
batching available (MESOS-9158), we can now deduplicate requests based on 
principal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9189) Include 'Connection: close' header in master streaming API responses.

2018-09-10 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608986#comment-16608986
 ] 

Alexander Rukletsov commented on MESOS-9189:


This is still in {{master}}. Is it on purpose [~gkleiman], [~bmahler]?

> Include 'Connection: close' header in master streaming API responses.
> -
>
> Key: MESOS-9189
> URL: https://issues.apache.org/jira/browse/MESOS-9189
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
> Attachments: bad_run.txt, good_run.txt
>
>
> We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections 
> to mesos as an optimization to avoid re-connection overhead. As a result, 
> when the end-client of the streaming API disconnects from the intermediary, 
> the intermediary leaves the connection to mesos open in an attempt to re-use 
> the connection for another request once the response completes. Mesos then 
> thinks that the subscriber never disconnected and the intermediary happily 
> continues to read the streaming events even though there's no end-client.
> To help indicate to intermediaries that the connection SHOULD NOT be re-used, 
> we can set the 'Connection: close' header for streaming API responses. It may 
> not be respected (since the language seems to be SHOULD NOT), but some 
> intermediaries may respect it and close the connection if the end-client 
> disconnects.
> Note that libprocess' http server currently doesn't close the the connection 
> based on a handler setting this header, but it doesn't matter here since the 
> streaming API responses are infinite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9194) Extend request batching to '/roles' endpoint

2018-09-06 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-9194:
--

 Assignee: Benno Evers
   Sprint: Mesosphere Sprint 2018-28
 Story Points: 3
   Labels: mesosphere  (was: )
Fix Version/s: 1.8.0

> Extend request batching to '/roles' endpoint
> 
>
> Key: MESOS-9194
> URL: https://issues.apache.org/jira/browse/MESOS-9194
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.8.0
>
>
> For consistency and improved performance under load, the `/roles` endpoint 
> should use the same request batching mechanism as `/state`, '/tasks`, ...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.

2018-09-06 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586014#comment-16586014
 ] 

Alexander Rukletsov edited comment on MESOS-9116 at 9/6/18 11:44 AM:
-

{noformat}
commit d95a16e03d27a2b6575148183e53a3b4507a16c1
Author: Andrei Budnik 
AuthorDate: Mon Aug 20 16:22:33 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Mon Aug 20 16:22:33 2018 +0200

Added `LaunchNestedContainerSessionInParallel` test.

This patch adds a test which verifies that launching multiple
short-lived nested container sessions succeeds. This test
implicitly verifies that agent correctly detects `mnt` namespace
of a command executor's task. If the detection fails, the
containerizer launcher (aka `nanny`) process fails to enter `mnt`
namespace, so it prints an error message into stderr for this
nested container.

This test is disabled until we fix MESOS-8545.

Review: https://reviews.apache.org/r/68256/
{noformat}
{noformat}
commit e78f636d84f2709da17275f7d70265520c0f4f94
Author: Andrei Budnik 
AuthorDate: Mon Aug 20 16:28:31 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Mon Aug 20 16:28:31 2018 +0200

Fixed incorrect `mnt` namespace detection of command executor's task.

Previously, we were walking the process tree from the container's
`init` process to find the first process along the way whose `mnt`
namespace differs from the `init` process. We expected this algorithm
to always return the PID of the command executor's task.

However, if someone launches multiple nested containers within the
process tree, the aforementioned algorithm might detect the PID of
one of those nested container instead of the command executor's task.
Even though the `mnt` namespace will be the same across all these
candidates, the detected PID might belong to a short-lived container,
which might terminate before the containerizer launcher (aka `nanny`
process) tries to enter its `mnt` namespace.

This patch fixes the detection algorithm so that it always returns
the PID of the command executor's task.

Review: https://reviews.apache.org/r/68257/
{noformat}
{noformat}
commit 31499a5dc1de29fa2178e6ea9e5398d8c668a933
Author: Andrei Budnik 
AuthorDate: Mon Aug 20 16:28:38 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Mon Aug 20 16:28:38 2018 +0200

Added `ROOT_CGROUPS_LaunchNestedDebugAfterUnshareMntNamespace` test.

This test verifies detection of task's `mnt` namespace for a debug
nested container. Debug nested container must enter `mnt` namespace
of the task, so the agent tries to detect task's `mnt` namespace.
This test launches a long-running task which runs a subtask that
unshares `mnt` namespace. The structure of the resulting process tree
is similar to the process tree of the command executor (the task of
the command executor unshares `mnt` ns):

  0. root (aka "nanny"/"launcher" process) [root `mnt` namespace]
1. task: sleep 1000 [root `mnt` namespace]
  2. subtaks: sleep 1000 [subtask's `mnt` namespace]

We expect that the agent detects task's `mnt` namespace.

Review: https://reviews.apache.org/r/68408/
{noformat}
{noformat}
commit b3c9c6939964831170e819f88134af7b275ffe1b
Author: Andrei Budnik 
AuthorDate: Mon Aug 20 16:28:44 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Mon Aug 20 16:28:44 2018 +0200

Fixed wrong `mnt` namespace detection for non-command executor tasks.

Previously, we were calling `getMountNamespaceTarget()` not only in
case of the command executor but in all other cases too, including
the default executor. That might lead to various subtle bugs, caused by
wrong detection of `mnt` namespace target. This patch fixes the issue
by setting a parent PID as `mnt` namespace target in case of
non-command executor task.

Review: https://reviews.apache.org/r/68348/
{noformat}
{noformat}
commit 52be35f47caea2712a0b13d7f963f7236533a2f1
Author: Andrei Budnik 
AuthorDate: Thu Sep 6 13:41:06 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Thu Sep 6 13:41:06 2018 +0200

Fixed `LaunchNestedContainerSessionsInParallel` test.

Previously, we sent `ATTACH_CONTAINER_OUTPUT` to attach to a
short-living nested container. An attempt to attach to a terminated
nested container leads to HTTP 500 error. This patch gets rid of
`ATTACH_CONTAINER_OUTPUT` in favor of `LAUNCH_NESTED_CONTAINER_SESSION`
so that we can read the container's output without using an extra call.

Review: https://reviews.apache.org/r/68236/
{noformat}


was (Author: alexr):
{noformat}
commit d95a16e03d27a2b6575148183e53a3b4507a16c1
Author: Andrei Budnik 
AuthorDate: Mon Aug 20 16:22:33 2018 +0200
Commit:

[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2018-09-05 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604170#comment-16604170
 ] 

Alexander Rukletsov commented on MESOS-8096:


Might be related to this issue, from {{clang-analyzer}}, courtesy [~mcypark]:
{noformat}
src/scheduler/scheduler.cpp:911:5: warning: Call to virtual function during 
destruction will not dispatch to derived class 
[clang-analyzer-optin.cplusplus.VirtualCall]
stop();
^
{noformat}
Likely a hypothetical control flow starting from 
{{src/tests/http_fault_tolerance_tests.cpp:872}}
{noformat}
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5:
 warning: Use of memory after it is freed [clang-analyzer-cplusplus.NewDelete]
return function_mocker_->AddNewExpectation(
^
/tmp/SRC/src/tests/http_fault_tolerance_tests.cpp:872:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1845:32:
 note: expanded from macro 'EXPECT_CALL'
#define EXPECT_CALL(obj, call) GMOCK_EXPECT_CALL_IMPL_(obj, call)
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1844:5:
 note: expanded from macro 'GMOCK_EXPECT_CALL_IMPL_'
((obj).gmock_##call).InternalExpectedAt(__FILE__, __LINE__, #obj, #call)
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12:
 note: Calling 'FunctionMockerBase::AddNewExpectation'
return function_mocker_->AddNewExpectation(
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1609:9:
 note: Memory is allocated
new TypedExpectation(this, file, line, source_text, m);
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:9:
 note: Assuming 'implicit_sequence' is equal to NULL
if (implicit_sequence != NULL) {
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1615:5:
 note: Taking false branch
if (implicit_sequence != NULL) {
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13:
 note: Calling '~linked_ptr'
return *expectation;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19:
 note: Calling 'linked_ptr::depart'
  ~linked_ptr() { depart(); }
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:5:
 note: Taking true branch
if (link_.depart()) delete value_;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:205:25:
 note: Memory is released
if (link_.depart()) delete value_;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googletest/include/gtest/internal/gtest-linked_ptr.h:153:19:
 note: Returning; memory was released
  ~linked_ptr() { depart(); }
  ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1619:13:
 note: Returning from '~linked_ptr'
return *expectation;
^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:12:
 note: Returning; memory was released
return function_mocker_->AddNewExpectation(
   ^
/BUILD/3rdparty/googletest-1.8.0/src/googletest-1.8.0/googlemock/include/gmock/gmock-spec-builders.h:1272:5:
 note: Use of memory after it is freed
return function_mocker_->AddNewExpectation(
^
{noformat}
There are what seems to be equivalent output for the following places:
{noformat}
/tmp/SRC/src/tests/uri_fetcher_tests.cpp:140:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(server, test(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/default_executor_tests.cpp:2042:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/scheduler_tests.cpp:2037:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*scheduler, connected(_))
  ^
{noformat}
{noformat}
/tmp/SRC/src/tests/fetcher_tests.cpp:535:3: note: Calling 
'MockSpec::InternalExpectedAt'
  EXPECT_CALL(*http.process, test(_))
  ^
{noformat}
Of all the {{EXPECT_CALL}} s in the codebase, these are the only instances that 
are pointed out. It is still unclear that there's an issue here, but it seems 
worth checking out, especially since these files are known-flaky.

> Enqueueing events in MockHTTPScheduler can lead to segfaults.
> -
>
>

[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.

2018-09-03 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276
 ] 

Alexander Rukletsov edited comment on MESOS-9116 at 9/3/18 10:09 AM:
-

Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}
Backports to 1.4.x (partial):
{noformat}
c37eb59e4c4b7b6c16509f317c78207da6eeb485
{noformat}


was (Author: alexr):
Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}
Backports to 1.4.x:
{noformat}
c37eb59e4c4b7b6c16509f317c78207da6eeb485
{noformat}

> Launch nested container session fails due to incorrect detection of `mnt` 
> namespace of command executor's task.
> ---
>
> Key: MESOS-9116
> URL: https://issues.apache.org/jira/browse/MESOS-9116
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.0
>
> Attachments: pstree.png
>
>
> Launch nested container call might fail with the following error:
> {code:java}
> Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such 
> file or directory
> {code}
> This happens when the containerizer launcher [tries to 
> enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892]
>  `mnt` namespace using the pid of a terminated process. The pid [was 
> detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958]
>  by the agent before spawning the containerizer launcher process, because the 
> process was running back then.
> The issue can be reproduced using the following test (pseudocode):
> {code:java}
> launchTask("sleep 1000")
> parentContainerId = containerizer.containers().begin()
> outputs = []
> for i in range(10):
>   ContainerId containerId
>   containerId.parent = parentContainerId
>   containerId.id = UUID.random()
>   LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo")
>   response = ATTACH_CONTAINER_OUTPUT(containerId)
>   outputs.append(response.reader)
> for output in outputs:
>   stdout, stderr = getProcessIOData(output)
>   assert("echo" == stdout + stderr){code}
> When we start the very first nested container, `getMountNamespaceTarget()` 
> returns a PID of the task (`sleep 1000`), because it's the only process whose 
> `mnt` namespace differs from the parent container. This nested container 
> becomes a child of PID 1 process, which is also a parent of the command 
> executor. It's not an executor's child! It can be seen in attached 
> `pstree.png`.
> When we start a second nested container, `getMountNamespaceTarget()` might 
> return PID of the previous nested container (`echo echo`) instead of the 
> task's PID (`sleep 1000`). It happens because the first nested container 
> entered `mnt` namespace of the task. Then, the containerizer launcher 
> ("nanny" process) attempts to enter `mnt` namespace using the PID of a 
> terminated process, so we get this error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.

2018-08-31 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276
 ] 

Alexander Rukletsov edited comment on MESOS-9116 at 8/31/18 3:15 PM:
-

Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}
Backports to 1.4.x:
{noformat}
c37eb59e4c4b7b6c16509f317c78207da6eeb485
{noformat}


was (Author: alexr):
Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}
Backports to 1.4.x:
{noformat}
c37eb59e4c4b7b6c16509f317c78207da6eeb485
05ec5d1770aeda25b4995487e40f690fe8fa6b19
{noformat}

> Launch nested container session fails due to incorrect detection of `mnt` 
> namespace of command executor's task.
> ---
>
> Key: MESOS-9116
> URL: https://issues.apache.org/jira/browse/MESOS-9116
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.0
>
> Attachments: pstree.png
>
>
> Launch nested container call might fail with the following error:
> {code:java}
> Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such 
> file or directory
> {code}
> This happens when the containerizer launcher [tries to 
> enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892]
>  `mnt` namespace using the pid of a terminated process. The pid [was 
> detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958]
>  by the agent before spawning the containerizer launcher process, because the 
> process was running back then.
> The issue can be reproduced using the following test (pseudocode):
> {code:java}
> launchTask("sleep 1000")
> parentContainerId = containerizer.containers().begin()
> outputs = []
> for i in range(10):
>   ContainerId containerId
>   containerId.parent = parentContainerId
>   containerId.id = UUID.random()
>   LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo")
>   response = ATTACH_CONTAINER_OUTPUT(containerId)
>   outputs.append(response.reader)
> for output in outputs:
>   stdout, stderr = getProcessIOData(output)
>   assert("echo" == stdout + stderr){code}
> When we start the very first nested container, `getMountNamespaceTarget()` 
> returns a PID of the task (`sleep 1000`), because it's the only process whose 
> `mnt` namespace differs from the parent container. This nested container 
> becomes a child of PID 1 process, which is also a parent of the command 
> executor. It's not an executor's child! It can be seen in attached 
> `pstree.png`.
> When we start a second nested container, `getMountNamespaceTarget()` might 
> return PID of the previous nested container (`echo echo`) instead of the 
> task's PID (`sleep 1000`). It happens because the first nested container 
> entered `mnt` namespace of the task. Then, the containerizer launcher 
> ("nanny" process) attempts to enter `mnt` namespace using the PID of a 
> terminated process, so we get this error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9116) Launch nested container session fails due to incorrect detection of `mnt` namespace of command executor's task.

2018-08-31 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586276#comment-16586276
 ] 

Alexander Rukletsov edited comment on MESOS-9116 at 8/31/18 11:56 AM:
--

Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}
Backports to 1.4.x:
{noformat}
c37eb59e4c4b7b6c16509f317c78207da6eeb485
05ec5d1770aeda25b4995487e40f690fe8fa6b19
{noformat}


was (Author: alexr):
Backports to 1.6.x:
{noformat}
cfba574408a85861d424a2c58d3d7277490c398e
6d884fbf9be169fd97483a1f341540c5354d88a9
a4409826deada53eef8843df1a0178e9edfa4c9c
20a4d4fae2f30f9e5436a154087c1a1bb9dc0629
{noformat}
Backports to 1.5.x:
{noformat}
6dd3fcc8ab2aecd182fff29deac07b32b3cc2d81
edeac7b0da5dd7ee1e4e50320d964eb84220d87d
966574a31a3f8c5d4f9a5f02eeb1644aff7fdc97
e4d8ab9911af6d494aae7f5762dd84b8f085fd1e
{noformat}

> Launch nested container session fails due to incorrect detection of `mnt` 
> namespace of command executor's task.
> ---
>
> Key: MESOS-9116
> URL: https://issues.apache.org/jira/browse/MESOS-9116
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.5.2, 1.6.2, 1.7.0
>
> Attachments: pstree.png
>
>
> Launch nested container call might fail with the following error:
> {code:java}
> Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such 
> file or directory
> {code}
> This happens when the containerizer launcher [tries to 
> enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892]
>  `mnt` namespace using the pid of a terminated process. The pid [was 
> detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958]
>  by the agent before spawning the containerizer launcher process, because the 
> process was running back then.
> The issue can be reproduced using the following test (pseudocode):
> {code:java}
> launchTask("sleep 1000")
> parentContainerId = containerizer.containers().begin()
> outputs = []
> for i in range(10):
>   ContainerId containerId
>   containerId.parent = parentContainerId
>   containerId.id = UUID.random()
>   LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo")
>   response = ATTACH_CONTAINER_OUTPUT(containerId)
>   outputs.append(response.reader)
> for output in outputs:
>   stdout, stderr = getProcessIOData(output)
>   assert("echo" == stdout + stderr){code}
> When we start the very first nested container, `getMountNamespaceTarget()` 
> returns a PID of the task (`sleep 1000`), because it's the only process whose 
> `mnt` namespace differs from the parent container. This nested container 
> becomes a child of PID 1 process, which is also a parent of the command 
> executor. It's not an executor's child! It can be seen in attached 
> `pstree.png`.
> When we start a second nested container, `getMountNamespaceTarget()` might 
> return PID of the previous nested container (`echo echo`) instead of the 
> task's PID (`sleep 1000`). It happens because the first nested container 
> entered `mnt` namespace of the task. Then, the containerizer launcher 
> ("nanny" process) attempts to enter `mnt` namespace using the PID of a 
> terminated process, so we get this error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7076) libprocess tests fail when using libevent 2.1.8

2018-08-31 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598433#comment-16598433
 ] 

Alexander Rukletsov commented on MESOS-7076:


Original libevent-ML thread: 
http://archives.seul.org/libevent/users/Feb-2018/msg3.html
Follow-up from Till: 
http://archives.seul.org/libevent/users/Aug-2018/msg9.html

> libprocess tests fail when using libevent 2.1.8
> ---
>
> Key: MESOS-7076
> URL: https://issues.apache.org/jira/browse/MESOS-7076
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess, test
> Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew)
>Reporter: Jan Schlicht
>Assignee: Till Toenshoff
>Priority: Critical
>  Labels: ci
> Attachments: libevent-openssl11.patch
>
>
> Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent 
> --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests 
> fail like
> {noformat}
> [ RUN  ] SSLTest.SSLSocket
> I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure
> Failed to wait 15secs for Socket(socket.get()).recv()
> [  FAILED  ] SSLTest.SSLSocket (15196 ms)
> {noformat}
> Tests failing are
> {noformat}
> SSLTest.SSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ECDHESupport
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> SSLTest.SilentSocket
> SSLTest.ShutdownThenSend
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258
 ] 

Alexander Rukletsov edited comment on MESOS-8545 at 8/29/18 3:01 PM:
-

When the agent handles {{ATTACH_CONTAINER_INPUT}} call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of {{ConnectionProcess}} is 
created, which calls 
[{{ConnectionProcess::read()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
{{ConnectionProcess}} calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a {{Response}} promise. This leads to responding back (to the 
{{AttachInputToNestedContainerSession}} 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an {{HTTP 500}} error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls {{terminate(self(), 
false)}} (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, {{IOSwitchboardServerProcess::finalize()}} sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
{{main()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment which includes steps to reproduce the bug.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the

[jira] [Assigned] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2018-08-29 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-4233:
--

Assignee: (was: Kapil Arya)

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>Priority: Major
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
> envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; 
> mem(*):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part 
> of normal operation and maybe should be logged at info / debug levels, but 
> not to a sysadmin (Just show when things fail, and maybe aggregate counters 
> to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would 
> prevent the giant port list attached, make that easily discoverable / bug 
> filable / fixable) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9189) Include 'Connection: close' header in streaming API responses.

2018-08-29 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596183#comment-16596183
 ] 

Alexander Rukletsov commented on MESOS-9189:


I'm not sure I understand how the change is supposed to help. {{'Connection: 
close'}} set by a server is an indicator for the client to close the connection 
_after_ receiveng the complete response. AFAIK, we don't ever complete the 
streaming response in Mesos and  there is no way for Mesos to somehow 
understand that an end client might not be interested in the stream any more 
and send an empty chunk. From a middleman's point of view the actual value of 
the {{'Connection'}} header is only interesting _after_ the response is 
completed, i.e., an empty chunk has been received, which, IIRC, never happens 
in our case.

Is the hope here is that some middlemen peek into the {{'Connection'}} header 
and based on it decide whether to close the connection themselves when their 
client disconnects even though the response might not be completed?

> Include 'Connection: close' header in streaming API responses.
> --
>
> Key: MESOS-9189
> URL: https://issues.apache.org/jira/browse/MESOS-9189
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections 
> to mesos as an optimization to avoid re-connection overhead. As a result, 
> when the end-client of the streaming API disconnects from the intermediary, 
> the intermediary leaves the connection to mesos open in an attempt to re-use 
> the connection for another request once the response completes. Mesos then 
> thinks that the subscriber never disconnected and the intermediary happily 
> continues to read the streaming events even though there's no end-client.
> To help indicate to intermediaries that the connection SHOULD NOT be re-used, 
> we can set the 'Connection: close' header for streaming API responses. It may 
> not be respected (since the language seems to be SHOULD NOT), but some 
> intermediaries may respect it and close the connection if the end-client 
> disconnects.
> Note that libprocess' http server currently doesn't close the the connection 
> based on a handler setting this header, but it doesn't matter here since the 
> streaming API responses are infinite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9158) Batch state-related read-only requests in the Master actor.

2018-08-28 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595547#comment-16595547
 ] 

Alexander Rukletsov edited comment on MESOS-9158 at 8/28/18 8:17 PM:
-

{noformat}
commit 4118a482a95793252f4713c5e20ef2c70f2ab07b
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:25:52 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:25:52 2018 +0200

Added '/state-summary' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68321/
{noformat}
{noformat}
commit 63e9096b0cd883d9edc8907a577bcba0b150b541
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:03 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:03 2018 +0200

Added '/tasks' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68440/
{noformat}
{noformat}
commit 33c38c9baa20b42562b519971df508283d988abc
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:11 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:11 2018 +0200

Added '/slaves' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68441/
{noformat}
{noformat}
commit 102dcca4e0116d2ffbdcd78d998e032841ffbabe
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:18 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:18 2018 +0200

Added '/frameworks' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68442/
{noformat}
{noformat}
commit 44e523490b394e6c43bce8b5304996137d176f96
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:25 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:25 2018 +0200

Moved members of `ReadOnlyHandler` into separate file.

Moved the member functions of class `ReadOnlyHandler` into
the new file `readonly_handler.cpp`. This follows the pattern
established by `weights_handler.cpp` and `quota_handler.cpp`.

As part of this move, it was also necessary to move some JSON
serialization that are used from both `master.cpp` and
`readonly_handler.cpp` to a new pair of files `json.{cpp,hpp}`
that can be used from both places.

Review: https://reviews.apache.org/r/68473/
{noformat}
{noformat}
commit 4930ec2e141920411fb9050500f385f5ef6a78a2
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:36 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:49:41 2018 +0200

Cleaned up some style issues in `ReadOnlyHandler`.

This commit fixes several minor style issues:
 - Sorted member function declarations of `ReadOnlyHandler`
   alphabetically.
 - Added notes to remind readers of the fact that requests
   to certain endpoints are batched.
 - Changed captured variable in `/frameworks` endpoint handler.

Review: https://reviews.apache.org/r/68537/
{noformat}


was (Author: alexr):
{noformat}
commit 4118a482a95793252f4713c5e20ef2c70f2ab07b
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:25:52 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:25:52 2018 +0200

Added '/state-summary' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68321/
{noformat}
{noformat}
commit 63e9096b0cd883d9edc8907a577bcba0b150b541
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:03 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:03 2018 +0200

Added '/tasks' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68440/
{noformat}
{noformat}
commit 33c38c9baa20b42562b519971df508283d988abc
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:11 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:11 2018 +0200

Added '/slaves' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68441/
{noformat}
{noformat}
commit 102dcca4e0116d2ffbdcd78d998e032841ffbabe
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:18 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:18 2018 +0200

Added '/frameworks' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68442/
{format}
{format}
commit 44e523490b394e6c43bce8b5304996137d176f96
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:25 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:25 2018 +0200

Moved members of `ReadOnlyHandler` into separate file.

Moved the member functions of class `ReadOnlyHandler` into
the new file `readonly_handler.cpp`. This follows the pattern
established by `weights_handler.cpp` and `quota_handler.cpp`.

As part of this move, it was also necessary to move some JSON
serialization that are used from both `master.cpp` and
`readonly_handler.cpp` to a new pair of

[jira] [Commented] (MESOS-9158) Batch state-related read-only requests in the Master actor.

2018-08-28 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595547#comment-16595547
 ] 

Alexander Rukletsov commented on MESOS-9158:


{noformat}
commit 4118a482a95793252f4713c5e20ef2c70f2ab07b
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:25:52 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:25:52 2018 +0200

Added '/state-summary' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68321/
{noformat}
{noformat}
commit 63e9096b0cd883d9edc8907a577bcba0b150b541
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:03 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:03 2018 +0200

Added '/tasks' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68440/
{noformat}
{noformat}
commit 33c38c9baa20b42562b519971df508283d988abc
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:11 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:11 2018 +0200

Added '/slaves' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68441/
{noformat}
{noformat}
commit 102dcca4e0116d2ffbdcd78d998e032841ffbabe
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:18 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:18 2018 +0200

Added '/frameworks' to the set of batched master endpoints.

Review: https://reviews.apache.org/r/68442/
{format}
{format}
commit 44e523490b394e6c43bce8b5304996137d176f96
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:25 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:26:25 2018 +0200

Moved members of `ReadOnlyHandler` into separate file.

Moved the member functions of class `ReadOnlyHandler` into
the new file `readonly_handler.cpp`. This follows the pattern
established by `weights_handler.cpp` and `quota_handler.cpp`.

As part of this move, it was also necessary to move some JSON
serialization that are used from both `master.cpp` and
`readonly_handler.cpp` to a new pair of files `json.{cpp,hpp}`
that can be used from both places.

Review: https://reviews.apache.org/r/68473/
{noformat}
{noformat}
commit 4930ec2e141920411fb9050500f385f5ef6a78a2
Author: Benno Evers 
AuthorDate: Tue Aug 28 21:26:36 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 21:49:41 2018 +0200

Cleaned up some style issues in `ReadOnlyHandler`.

This commit fixes several minor style issues:
 - Sorted member function declarations of `ReadOnlyHandler`
   alphabetically.
 - Added notes to remind readers of the fact that requests
   to certain endpoints are batched.
 - Changed captured variable in `/frameworks` endpoint handler.

Review: https://reviews.apache.org/r/68537/
{noformat}

> Batch state-related read-only requests in the Master actor.
> ---
>
> Key: MESOS-9158
> URL: https://issues.apache.org/jira/browse/MESOS-9158
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere, performance
>
> Similar to MESOS-9122, make all read-only master state endpoints batched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9185) An attempt to remove or destroy container in composing containerizer leads to segfault.

2018-08-28 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595084#comment-16595084
 ] 

Alexander Rukletsov edited comment on MESOS-9185 at 8/28/18 4:11 PM:
-

*1.8.0-dev:*
{noformat}
commit 8496b369d52d27e90da88787242fd6f9d9abb78e
Author: Andrei Budnik 
AuthorDate: Tue Aug 28 16:46:54 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 16:46:54 2018 +0200

Added `AgentAPITest.LaunchNestedContainerWithUnknownParent` test.

This test verifies that launch nested container fails when the parent
container is unknown to the containerizer.

Review: https://reviews.apache.org/r/68234/
{noformat}
{noformat}
commit 5fbfb8da5ad62c40752fa7b7e0a0842c892f6857
Author: Andrei Budnik 
AuthorDate: Tue Aug 28 16:47:04 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 16:47:04 2018 +0200

Cleaned up container on launch failures in composing containerizer.

Previously, if a parent container was unknown to the composing
containerizer during an attempt to launch a nested container
via `ComposingContainerizerProcess::launch()`, the composing
containerizer returned an error without cleaning up the container.
The `containerizer` field was uninitialized, so a further attempt
to remove or destroy the nested container led to segfault.

This patch removes the container when the parent container is unknown.

Review: https://reviews.apache.org/r/68235/
{noformat}
*backport to 1.7.1:*
{noformat}
commit 1660a0552e58ba4407180508f7e4eeed2050b2a2
Author: Andrei Budnik 
AuthorDate: Tue Aug 28 16:47:04 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 18:07:44 2018 +0200

Cleaned up container on launch failures in composing containerizer.

Previously, if a parent container was unknown to the composing
containerizer during an attempt to launch a nested container
via `ComposingContainerizerProcess::launch()`, the composing
containerizer returned an error without cleaning up the container.
The `containerizer` field was uninitialized, so a further attempt
to remove or destroy the nested container led to segfault.

This patch removes the container when the parent container is unknown.

Review: https://reviews.apache.org/r/68235/
(cherry picked from commit 5fbfb8da5ad62c40752fa7b7e0a0842c892f6857)
{noformat}


was (Author: alexr):
{noformat}
commit 8496b369d52d27e90da88787242fd6f9d9abb78e
Author: Andrei Budnik 
AuthorDate: Tue Aug 28 16:46:54 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 16:46:54 2018 +0200

Added `AgentAPITest.LaunchNestedContainerWithUnknownParent` test.

This test verifies that launch nested container fails when the parent
container is unknown to the containerizer.

Review: https://reviews.apache.org/r/68234/
{noformat}
{noformat}
commit 5fbfb8da5ad62c40752fa7b7e0a0842c892f6857
Author: Andrei Budnik 
AuthorDate: Tue Aug 28 16:47:04 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 28 16:47:04 2018 +0200

Cleaned up container on launch failures in composing containerizer.

Previously, if a parent container was unknown to the composing
containerizer during an attempt to launch a nested container
via `ComposingContainerizerProcess::launch()`, the composing
containerizer returned an error without cleaning up the container.
The `containerizer` field was uninitialized, so a further attempt
to remove or destroy the nested container led to segfault.

This patch removes the container when the parent container is unknown.

Review: https://reviews.apache.org/r/68235/
{noformat}

> An attempt to remove or destroy container in composing containerizer leads to 
> segfault.
> ---
>
> Key: MESOS-9185
> URL: https://issues.apache.org/jira/browse/MESOS-9185
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.8.0
>
>
> `LAUNCH_NESTED_CONTAINER` and `LAUNCH_NESTED_CONTAINER_SESSION` leads to 
> segfault in the agent when the parent container is unknown to the composing 
> containerizer. If the parent container cannot be found during an attempt to 
> launch a nested container via `ComposingContainerizerProcess::launch()`, the 
> composing container returns an error without cleaning up the container. On 
> `launch()` failures, the agent calls `destroy()` which accesses uninitialized 
> `containerizer` field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8345) Improve master responsiveness while serving state information.

2018-08-22 Thread Alexander Rukletsov (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8345:
--

Assignee: Alexander Rukletsov

> Improve master responsiveness while serving state information.
> --
>
> Key: MESOS-8345
> URL: https://issues.apache.org/jira/browse/MESOS-8345
> Project: Mesos
>  Issue Type: Epic
>  Components: HTTP API, master
>Reporter: Benjamin Mahler
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, performance
>
> Currently when state is requested from the master, the response is built 
> using the master actor. This means that when the master is building an 
> expensive state response, the master is locked and cannot process other 
> events. This in turn can lead to higher latency on further requests to state. 
> Previous performance improvements to JSON generation (MESOS-4235) alleviated 
> this issue, but for large cluster with a lot of clients this can still be a 
> problem.
> It's possible to serve state outside of the master actor by streaming the 
> state (re-using the existing streaming operator API) into another actor(s) 
> and serving from there.
> NOTE: I believe this approach will incur a small performance cost during 
> master failover, since the master has to perform an additional copy of state 
> that it fans out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9177:
--

 Summary: Mesos master segfaults when responding to /state requests.
 Key: MESOS-9177
 URL: https://issues.apache.org/jira/browse/MESOS-9177
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.7.0
Reporter: Alexander Rukletsov


{noformat}
 *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
stack trace: ***
 @ 0x7f367e7226d0 (unknown)
 @ 0x7f3681266913 
_ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
 @ 0x7f3681266af0 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f36812882d0 mesos::internal::master::FullFrameworkWriter::operator()()
 @ 0x7f36812889d0 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f368121aef0 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f3681241be3 
_ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
 @ 0x7f3681242760 
_ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
 @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
 @ 0x7f368215f60e process::http::OK::OK()
 @ 0x7f3681219061 
_ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
 @ 0x7f36812212c0 
_ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
 @ 0x7f36812215ac 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_
 @ 0x7f36821f3541 process::ProcessBase::consume()
 @ 0x7f3682209fbc process::ProcessManager::resume()
 @ 0x7f368220fa76 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
 @ 0x7f367eefc2b0 (unknown)
 @ 0x7f367e71ae25 start_thread
 @ 0x7f367e444bad __clone
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9176) Mesos does not work properly on modern Ubuntu distributions.

2018-08-22 Thread Alexander Rukletsov (JIRA)

Alexander Rukletsov created MESOS-9176:
--

 Summary: Mesos does not work properly on modern Ubuntu 
distributions.
 Key: MESOS-9176
 URL: https://issues.apache.org/jira/browse/MESOS-9176
 Project: Mesos
  Issue Type: Epic
Affects Versions: 1.7.0
 Environment: Ubuntu 17.10
Ubuntu 18.04
Reporter: Alexander Rukletsov


We have observed several issues in various components on moder Ubuntus, e.g., 
17.10, 18.04. Needless to say, we need to ensure Mesos compiles and runs fine 
on those distros.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-9000) Operator API event stream can miss task status updates.

2018-08-21 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587188#comment-16587188
 ] 

Alexander Rukletsov edited comment on MESOS-9000 at 8/21/18 9:12 AM:
-

On the 1.8.0-dev:
{noformat}
commit 613741147123563f7b68e900c321e7f5db8236fe
Author: Benno Evers 
AuthorDate: Tue Aug 21 10:58:35 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 21 11:04:37 2018 +0200

Changed operator API to notify subscribers on every status change.

Prior to this change, the master would only send `TaskUpdated`
messages to subscribers when the latest known task state on the
agent changed.

This implied that schedulers could not reliably wait for the status
information corresponding to specific state updates, i.e.,
`TASK_RUNNING`, since there is no guarantee that subscribers get
notified during the time when this status update will be included in
the status field.

After this change, `TaskUpdated` messages are sent whenever the latest
acknowledged state of the task changes.

Review: https://reviews.apache.org/r/67575/
{noformat}


was (Author: alexr):
{noformat}
commit 613741147123563f7b68e900c321e7f5db8236fe
Author: Benno Evers 
AuthorDate: Tue Aug 21 10:58:35 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 21 11:04:37 2018 +0200

Changed operator API to notify subscribers on every status change.

Prior to this change, the master would only send `TaskUpdated`
messages to subscribers when the latest known task state on the
agent changed.

This implied that schedulers could not reliably wait for the status
information corresponding to specific state updates, i.e.,
`TASK_RUNNING`, since there is no guarantee that subscribers get
notified during the time when this status update will be included in
the status field.

After this change, `TaskUpdated` messages are sent whenever the latest
acknowledged state of the task changes.

Review: https://reviews.apache.org/r/67575/
{noformat}

> Operator API event stream can miss task status updates.
> ---
>
> Key: MESOS-9000
> URL: https://issues.apache.org/jira/browse/MESOS-9000
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.7.0
>
>
> As of now, the master only sends TaskUpdated messages to subscribers when the 
> latest known task state on the agent changed:
> {noformat}
>   // src/master/master.cpp
>   if (!protobuf::isTerminalState(task->state())) {
> if (status.state() != task->state()) {
>   sendSubscribersUpdate = true;
> }
> task->set_state(latestState.getOrElse(status.state()));
>   }
> {noformat}
> The latest state is set like this:
> {noformat}
> // src/messages/messages.proto
> message StatusUpdate {
>   [...]
>   // This corresponds to the latest state of the task according to the
>   // agent. Note that this state might be different than the state in
>   // 'status' because task status update manager queues updates. In
>   // other words, 'status' corresponds to the update at top of the
>   // queue and 'latest_state' corresponds to the update at bottom of
>   // the queue.
>   optional TaskState latest_state = 7;
> }
> {noformat}
> However, the `TaskStatus` message included in an `TaskUpdated` event is the 
> event at the bottom of the queue when the update was sent.
> So we can easily get in a situation where e.g. the first TaskUpdated has 
> .status.state == TASK_STARTING and .state == TASK_RUNNING, and the second 
> update with .status.state == TASK_RUNNNING and .state == TASK_RUNNING would 
> not get delivered because the latest known state did not change.
> This implies that schedulers can not reliably wait for the status information 
> corresponding to specific task state, since there is no guarantee that 
> subscribers get notified during the time when this status update will be 
> included in the status field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9000) Operator API event stream can miss task status updates.

2018-08-21 Thread Alexander Rukletsov (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587193#comment-16587193
 ] 

Alexander Rukletsov commented on MESOS-9000:


On the 1.7.x branch:
{noformat}
commit a2f826d5a641b8ae3e5742ffeab7166281e296f8
Author: Benno Evers 
AuthorDate: Tue Aug 21 10:58:35 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Aug 21 11:08:41 2018 +0200

Changed operator API to notify subscribers on every status change.

Prior to this change, the master would only send `TaskUpdated`
messages to subscribers when the latest known task state on the
agent changed.

This implied that schedulers could not reliably wait for the status
information corresponding to specific state updates, i.e.,
`TASK_RUNNING`, since there is no guarantee that subscribers get
notified during the time when this status update will be included in
the status field.

After this change, `TaskUpdated` messages are sent whenever the latest
acknowledged state of the task changes.

Review: https://reviews.apache.org/r/67575/
{noformat}

> Operator API event stream can miss task status updates.
> ---
>
> Key: MESOS-9000
> URL: https://issues.apache.org/jira/browse/MESOS-9000
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.7.0
>
>
> As of now, the master only sends TaskUpdated messages to subscribers when the 
> latest known task state on the agent changed:
> {noformat}
>   // src/master/master.cpp
>   if (!protobuf::isTerminalState(task->state())) {
> if (status.state() != task->state()) {
>   sendSubscribersUpdate = true;
> }
> task->set_state(latestState.getOrElse(status.state()));
>   }
> {noformat}
> The latest state is set like this:
> {noformat}
> // src/messages/messages.proto
> message StatusUpdate {
>   [...]
>   // This corresponds to the latest state of the task according to the
>   // agent. Note that this state might be different than the state in
>   // 'status' because task status update manager queues updates. In
>   // other words, 'status' corresponds to the update at top of the
>   // queue and 'latest_state' corresponds to the update at bottom of
>   // the queue.
>   optional TaskState latest_state = 7;
> }
> {noformat}
> However, the `TaskStatus` message included in an `TaskUpdated` event is the 
> event at the bottom of the queue when the update was sent.
> So we can easily get in a situation where e.g. the first TaskUpdated has 
> .status.state == TASK_STARTING and .state == TASK_RUNNING, and the second 
> update with .status.state == TASK_RUNNNING and .state == TASK_RUNNING would 
> not get delivered because the latest known state did not change.
> This implies that schedulers can not reliably wait for the status information 
> corresponding to specific task state, since there is no guarantee that 
> subscribers get notified during the time when this status update will be 
> included in the status field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1942 matches

Mail list logo