[jira] [Updated] (MESOS-5886) FUTURE_DISPATCH may react on irrelevant dispatch.

2017-08-04 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-5886:
-
Shepherd: Michael Park

> FUTURE_DISPATCH may react on irrelevant dispatch.
> -
>
> Key: MESOS-5886
> URL: https://issues.apache.org/jira/browse/MESOS-5886
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt, tech-debt-test
>
> [{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
>  uses 
> [{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
>  to figure out whether a processed {{DispatchEvent}} is the same the user is 
> waiting for. However, comparing {{std::type_info}} of function pointers is 
> not enough: different class methods with same signatures will be matched. 
> Here is the test that proves this:
> {noformat}
> class DispatchProcess : public Process
> {
> public:
>   MOCK_METHOD0(func0, void());
>   MOCK_METHOD1(func1, bool(bool));
>   MOCK_METHOD1(func1_same_but_different, bool(bool));
>   MOCK_METHOD1(func2, Future(bool));
>   MOCK_METHOD1(func3, int(int));
>   MOCK_METHOD2(func4, Future(bool, int));
> };
> {noformat}
> {noformat}
> TEST(ProcessTest, DispatchMatch)
> {
>   DispatchProcess process;
>   PID pid = spawn();
>   Future future = FUTURE_DISPATCH(
>   pid,
>   ::func1_same_but_different);
>   EXPECT_CALL(process, func1(_))
> .WillOnce(ReturnArg<0>());
>   dispatch(pid, ::func1, true);
>   AWAIT_READY(future);
>   terminate(pid);
>   wait(pid);
> }
> {noformat}
> The test passes:
> {noformat}
> [ RUN  ] ProcessTest.DispatchMatch
> [   OK ] ProcessTest.DispatchMatch (1 ms)
> {noformat}
> This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7892) Filter results of `/state` on agent by role.

2017-08-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7892:
-
Sprint: Mesosphere Sprint 61

> Filter results of `/state` on agent by role.
> 
>
> Key: MESOS-7892
> URL: https://issues.apache.org/jira/browse/MESOS-7892
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>  Labels: mesosphere, security
>
> The results returned by {{/state}} contain information about reservations per 
> role which should be filtered for certain users, particularly in a 
> multi-tenancy scenario.
> The kind of leaked data includes specific role names and their specific 
> reservations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7892) Filter results of `/state` on agent by role.

2017-08-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7892:
-
Description: 
The results returned by {{/state}} include information about resource 
reservations per each role, which should be filtered for certain users, 
particularly in a multi-tenancy scenario.

The kind of leaked data includes specific role names and their specific 
reservations.

  was:
The results returned by {{/state}} contain information about reservations per 
role which should be filtered for certain users, particularly in a 
multi-tenancy scenario.

The kind of leaked data includes specific role names and their specific 
reservations.


> Filter results of `/state` on agent by role.
> 
>
> Key: MESOS-7892
> URL: https://issues.apache.org/jira/browse/MESOS-7892
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>  Labels: mesosphere, security
>
> The results returned by {{/state}} include information about resource 
> reservations per each role, which should be filtered for certain users, 
> particularly in a multi-tenancy scenario.
> The kind of leaked data includes specific role names and their specific 
> reservations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7892) Filter results of `/state` on agent by role.

2017-08-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7892:
-
Description: 
The results returned by {{/state}} include data about resource reservations per 
each role, which should be filtered for certain users, particularly in a 
multi-tenancy scenario.

The kind of leaked data includes specific role names and their specific 
reservations.

  was:
The results returned by {{/state}} include information about resource 
reservations per each role, which should be filtered for certain users, 
particularly in a multi-tenancy scenario.

The kind of leaked data includes specific role names and their specific 
reservations.


> Filter results of `/state` on agent by role.
> 
>
> Key: MESOS-7892
> URL: https://issues.apache.org/jira/browse/MESOS-7892
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>  Labels: mesosphere, security
>
> The results returned by {{/state}} include data about resource reservations 
> per each role, which should be filtered for certain users, particularly in a 
> multi-tenancy scenario.
> The kind of leaked data includes specific role names and their specific 
> reservations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7791) subprocess' childMain using ABORT when encountering user errors

2017-08-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7791:


Assignee: Andrei Budnik

> subprocess' childMain using ABORT when encountering user errors
> ---
>
> Key: MESOS-7791
> URL: https://issues.apache.org/jira/browse/MESOS-7791
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
>Reporter: Benjamin Bannier
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt
>
> In {{process/posix/subprocess.hpp}}'s {{childMain}} we exit with {{ABORT}} 
> when there was a user error,
> {noformat}
> ABORT: 
> (/pkg/src/mesos/3rdparty/libprocess/include/process/posix/subprocess.hpp:195):
>  Failed to os::execvpe on path '/SOME/PATH': Argument list too long
> {noformat}
> We here abort instead of simply {{_exit}}'ing and letting the user know that 
> we couldn't deal with the given arguments.
> Abort can potentially dump core, and since this abort is before the 
> {{execvpe}}, the process image can potentially be large (e.g., >300 MB) which 
> could quickly fill up a lot of disk space.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7892) Filter results of `/state` on agent by role.

2017-08-15 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-7892:


 Summary: Filter results of `/state` on agent by role.
 Key: MESOS-7892
 URL: https://issues.apache.org/jira/browse/MESOS-7892
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Andrei Budnik
Assignee: Andrei Budnik


The results returned by {{/state}} contain information about reservations per 
role which should be filtered for certain users, particularly in a 
multi-tenancy scenario.

The kind of leaked data includes specific role names and their specific 
reservations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7791) subprocess' childMain using ABORT when encountering user errors

2017-08-17 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7791:
-
Sprint: Mesosphere Sprint 61

> subprocess' childMain using ABORT when encountering user errors
> ---
>
> Key: MESOS-7791
> URL: https://issues.apache.org/jira/browse/MESOS-7791
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
>Reporter: Benjamin Bannier
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt
>
> In {{process/posix/subprocess.hpp}}'s {{childMain}} we exit with {{ABORT}} 
> when there was a user error,
> {noformat}
> ABORT: 
> (/pkg/src/mesos/3rdparty/libprocess/include/process/posix/subprocess.hpp:195):
>  Failed to os::execvpe on path '/SOME/PATH': Argument list too long
> {noformat}
> We here abort instead of simply {{_exit}}'ing and letting the user know that 
> we couldn't deal with the given arguments.
> Abort can potentially dump core, and since this abort is before the 
> {{execvpe}}, the process image can potentially be large (e.g., >300 MB) which 
> could quickly fill up a lot of disk space.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-6441) Display reservations in the agent page in the webui.

2017-07-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059541#comment-16059541
 ] 

Andrei Budnik edited comment on MESOS-6441 at 7/18/17 10:44 AM:


https://reviews.apache.org/r/60867/
https://reviews.apache.org/r/60636/
https://reviews.apache.org/r/60907/
https://reviews.apache.org/r/60369/
https://reviews.apache.org/r/60915/
https://reviews.apache.org/r/60539/
https://reviews.apache.org/r/60370/


was (Author: abudnik):
[https://reviews.apache.org/r/60369/]
[https://reviews.apache.org/r/60370/]

> Display reservations in the agent page in the webui.
> 
>
> Key: MESOS-6441
> URL: https://issues.apache.org/jira/browse/MESOS-6441
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Andrei Budnik
> Fix For: 1.4.0
>
>
> We currently do not display the reservations present on an agent in the 
> webui. It would be nice to see this information.
> It would also be nice to update the resource statistics tables to make the 
> distinction between unreserved and reserved resources. E.g.
> Reserved:
> Used, Allocated, Available and Total
> Unreserved:
> Used, Allocated, Available and Total



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7775) Eliminate extra process abort in a subprocess watchdog

2017-07-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7775:


Assignee: Andrei Budnik

> Eliminate extra process abort in a subprocess watchdog
> --
>
> Key: MESOS-7775
> URL: https://issues.apache.org/jira/browse/MESOS-7775
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>
> `abort()` is called in `SUPERVISOR` hook when child process exits with an 
> error code, or `waitpid()` fails, or parent process exits. All these cases 
> shouldn't lead to abnormal program termination with coredumps.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7775) Eliminate extra process abort in a subprocess watchdog

2017-07-10 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-7775:


 Summary: Eliminate extra process abort in a subprocess watchdog
 Key: MESOS-7775
 URL: https://issues.apache.org/jira/browse/MESOS-7775
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik


`abort()` is called in `SUPERVISOR` hook when child process exits with an error 
code, or `waitpid()` fails, or parent process exits. All these cases shouldn't 
lead to abnormal program termination with coredumps.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7871) Agent fails assertion during request to '/state'

2017-08-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7871:


Assignee: Andrei Budnik  (was: Greg Mann)

https://reviews.apache.org/r/61524/

> Agent fails assertion during request to '/state'
> 
>
> Key: MESOS-7871
> URL: https://issues.apache.org/jira/browse/MESOS-7871
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>  Labels: mesosphere
>
> While processing requests to {{/state}}, the Mesos agent calls 
> {{Framework::allocatedResources()}}, which in turn calls 
> {{Slave::getExecutorInfo()}} on executors associated with the framework's 
> pending tasks.
> In the case of tasks launched as part of task groups, this leads to the 
> failure of the assertion 
> [here|https://github.com/apache/mesos/blob/a31dd52ab71d2a529b55cd9111ec54acf7550ded/src/slave/slave.cpp#L4983-L4985].
>  This means that the check will fail if the agent processes a request to 
> {{/state}} at a time when it has pending tasks launched as part of a task 
> group.
> This assertion should be removed since this helper function is now used with 
> task groups.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7791) subprocess' childMain using ABORT when encountering user errors

2017-08-19 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7791:
-
Story Points: 5

> subprocess' childMain using ABORT when encountering user errors
> ---
>
> Key: MESOS-7791
> URL: https://issues.apache.org/jira/browse/MESOS-7791
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
>Reporter: Benjamin Bannier
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt
>
> In {{process/posix/subprocess.hpp}}'s {{childMain}} we exit with {{ABORT}} 
> when there was a user error,
> {noformat}
> ABORT: 
> (/pkg/src/mesos/3rdparty/libprocess/include/process/posix/subprocess.hpp:195):
>  Failed to os::execvpe on path '/SOME/PATH': Argument list too long
> {noformat}
> We here abort instead of simply {{_exit}}'ing and letting the user know that 
> we couldn't deal with the given arguments.
> Abort can potentially dump core, and since this abort is before the 
> {{execvpe}}, the process image can potentially be large (e.g., >300 MB) which 
> could quickly fill up a lot of disk space.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7160) Parsing of perf version segfaults

2017-06-26 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063108#comment-16063108
 ] 

Andrei Budnik commented on MESOS-7160:
--

It's very unusual that parent fails to wait for a child. Sure, {{waitpid}} can 
be called before or after child exits. But AFAIK it shouldn't be a race 
condition as kernel keeps {{task_struct}} until parent process invokes 
{{wait*()}}.

In addition, if path to an executable is invalid, then {{execv}} will fail, 
causing invocation of {{abort()}} twice:
[https://github.com/apache/mesos/blob/18695ae8d5cfc209072950e887495a42dd83a1d9/3rdparty/libprocess/include/process/posix/subprocess.hpp#L195]
[https://github.com/apache/mesos/blob/18695ae8d5cfc209072950e887495a42dd83a1d9/3rdparty/libprocess/src/subprocess.cpp#L181]

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7160) Parsing of perf version segfaults

2017-06-27 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7160:
-
Sprint: Mesosphere Sprint 58

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Andrei Budnik
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6961) Executors don't use glog for logging.

2017-05-26 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-6961:


Assignee: Andrei Budnik

> Executors don't use glog for logging.
> -
>
> Key: MESOS-6961
> URL: https://issues.apache.org/jira/browse/MESOS-6961
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: log, mesosphere, newbie++
>
> Built-in Mesos executors use {{cout}}/{{cerr}} for logging. This is not only 
> inconsistent with the rest of the codebase, it also complicates debugging, 
> since, e.g., a stack trace is not printed on an abort. Having timestamps will 
> be also a huge plus.
> Consider migrating logging in all built-in executors to glog.
> There have been reported issues related to glog internal state races when a 
> process that has glog initialized {{fork-exec}}s another process that also 
> initialize glog. We should investigate how this issue is related to this 
> ticket, cc [~tillt], [~vinodkone], [~bmahler].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-05-30 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-7586:


 Summary: Make use of cout/cerr and glog consistent.
 Key: MESOS-7586
 URL: https://issues.apache.org/jira/browse/MESOS-7586
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik
Priority: Minor


Some parts of mesos use glog before initialization of glog. This leads to 
message like:
“WARNING: Logging before InitGoogleLogging() is written to STDERR”
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
cout/cerr should be used before logging initialization.
glog should be used after logging initialization.
 
Usually, main function has pattern like:
1. load = flags.load(argc, argv) // Load flags from command line.
2. Check if flags are correct, otherwise print error message to cerr and then 
exit.
3. Check if user passed --help flag to print help message to cout and then exit.
4. Parsing and setup of environment variables. If this fails, EXIT macro is 
used to print error message via glog.
5. process::initialize()
6. logging::initialize()
7. ...
 
Steps 2 and 3 should use cout/cerr to eliminate any extra information generated 
by glog like current time, date and log level.
It is possible to move step 6 between steps 3 and 4 safely, because 
logging::initialize() doesn’t depend on process::initialize().
Some parts of mesos don’t call logging::initialize(). This should also be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-5886) FUTURE_DISPATCH may react on irrelevant dispatch.

2017-06-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-5886:


Assignee: Andrei Budnik

> FUTURE_DISPATCH may react on irrelevant dispatch.
> -
>
> Key: MESOS-5886
> URL: https://issues.apache.org/jira/browse/MESOS-5886
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt, tech-debt-test
>
> [{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
>  uses 
> [{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
>  to figure out whether a processed {{DispatchEvent}} is the same the user is 
> waiting for. However, comparing {{std::type_info}} of function pointers is 
> not enough: different class methods with same signatures will be matched. 
> Here is the test that proves this:
> {noformat}
> class DispatchProcess : public Process
> {
> public:
>   MOCK_METHOD0(func0, void());
>   MOCK_METHOD1(func1, bool(bool));
>   MOCK_METHOD1(func1_same_but_different, bool(bool));
>   MOCK_METHOD1(func2, Future(bool));
>   MOCK_METHOD1(func3, int(int));
>   MOCK_METHOD2(func4, Future(bool, int));
> };
> {noformat}
> {noformat}
> TEST(ProcessTest, DispatchMatch)
> {
>   DispatchProcess process;
>   PID pid = spawn();
>   Future future = FUTURE_DISPATCH(
>   pid,
>   ::func1_same_but_different);
>   EXPECT_CALL(process, func1(_))
> .WillOnce(ReturnArg<0>());
>   dispatch(pid, ::func1, true);
>   AWAIT_READY(future);
>   terminate(pid);
>   wait(pid);
> }
> {noformat}
> The test passes:
> {noformat}
> [ RUN  ] ProcessTest.DispatchMatch
> [   OK ] ProcessTest.DispatchMatch (1 ms)
> {noformat}
> This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6441) Display reservations in the agent page in the webui.

2017-06-16 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051609#comment-16051609
 ] 

Andrei Budnik commented on MESOS-6441:
--

Here is the model of a table. It will be on agent page just above the 
"Framework" and "Completed Frameworks":
https://docs.google.com/spreadsheets/d/1o2DapAFKJTMN0IKcjbtYatRWh273-WsbN951vQ3tTqs

Please, feel free to make any comments and suggestions on it.

> Display reservations in the agent page in the webui.
> 
>
> Key: MESOS-6441
> URL: https://issues.apache.org/jira/browse/MESOS-6441
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Andrei Budnik
>
> We currently do not display the reservations present on an agent in the 
> webui. It would be nice to see this information.
> It would also be nice to update the resource statistics tables to make the 
> distinction between unreserved and reserved resources. E.g.
> Reserved:
> Used, Allocated, Available and Total
> Unreserved:
> Used, Allocated, Available and Total



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6441) Display reservations in the agent page in the webui.

2017-06-16 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051635#comment-16051635
 ] 

Andrei Budnik commented on MESOS-6441:
--

You can see in the model of the table "Allocated / Total" per each resource per 
each role.
Data related to amount of resources "Total" per each role on a specific agent 
is already available via /state endpoint on agent (see "reserved_resources" & 
"unreserved_resources").
Data related to amount of resources "Allocated" per each role on a specific 
agent is needed, but not provided right now. 
I'm trying to find the best way to get this allocated resources on mesos agent. 
What I'm pretty sure is that we need to modify agent's code to be able to show 
"Allocated".

> Display reservations in the agent page in the webui.
> 
>
> Key: MESOS-6441
> URL: https://issues.apache.org/jira/browse/MESOS-6441
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Andrei Budnik
>
> We currently do not display the reservations present on an agent in the 
> webui. It would be nice to see this information.
> It would also be nice to update the resource statistics tables to make the 
> distinction between unreserved and reserved resources. E.g.
> Reserved:
> Used, Allocated, Available and Total
> Unreserved:
> Used, Allocated, Available and Total



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6441) Display reservations in the agent page in the webui.

2017-06-16 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051971#comment-16051971
 ] 

Andrei Budnik commented on MESOS-6441:
--

Here is a small design doc related to modification of agent's endpoint:
https://docs.google.com/document/d/1TeyCLcwLKZ5F2NP7rADIEK34gLM2iq4wdxt2anI6saY

> Display reservations in the agent page in the webui.
> 
>
> Key: MESOS-6441
> URL: https://issues.apache.org/jira/browse/MESOS-6441
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Andrei Budnik
>
> We currently do not display the reservations present on an agent in the 
> webui. It would be nice to see this information.
> It would also be nice to update the resource statistics tables to make the 
> distinction between unreserved and reserved resources. E.g.
> Reserved:
> Used, Allocated, Available and Total
> Unreserved:
> Used, Allocated, Available and Total



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-4812) Mesos fails to escape command health checks

2017-09-18 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-4812:


Assignee: Andrei Budnik  (was: haosdent)

Reworked Haosdent's patch:
https://reviews.apache.org/r/62381/

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: Andrei Budnik
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-21 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174785#comment-16174785
 ] 

Andrei Budnik commented on MESOS-7500:
--

Another example from the failed run, including debug output 
(https://reviews.apache.org/r/59107):
https://pastebin.com/iKA1WaZB

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2017-10-04 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191117#comment-16191117
 ] 

Andrei Budnik commented on MESOS-4812:
--

I have closed [/r/62381|https://reviews.apache.org/r/62381/], for details see 
comment in discard reason.

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: Andrei Budnik
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-05 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192797#comment-16192797
 ] 

Andrei Budnik commented on MESOS-7504:
--

Code modifications to reproduce test failure:
1. Add {{::sleep(1);}} to 
https://github.com/apache/mesos/blob/657a930e173aaee7a168734bf59e8eb022d6668f/src/tests/containerizer/nested_mesos_containerizer_tests.cpp#L1144
2. Add {{launchInfo.add_pre_exec_commands()->set_value("sleep 2");}} to 
https://github.com/apache/mesos/blob/657a930e173aaee7a168734bf59e8eb022d6668f/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L135
3. Add {{::sleep(3);}} to 
https://github.com/apache/mesos/blob/657a930e173aaee7a168734bf59e8eb022d6668f/src/slave/containerizer/mesos/utils.cpp#L73

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 

[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-05 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193100#comment-16193100
 ] 

Andrei Budnik commented on MESOS-7504:
--

Containerizer launcher spawns 
[pre-exec-hooks|https://github.com/apache/mesos/blob/46db7e4f27831d20244a57b22a70312f2a574395/src/slave/containerizer/mesos/launch.cpp#L384]
 before launching given command (e.g. `sleep 1000`).
For 
{{NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover}} 
test, we need to enter {{"cgroups/cpu,filesystem/linux,namespaces/pid"}} 
namespaces, where `filesystem/linux` and `namespaces/pid` isolators add 2 
pre-exec-hooks, from logs:
{code}
Executing pre-exec command 
'{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/abudnik\/mesos\/build\/src\/mesos-containerizer"}'
Executing pre-exec command '{"shell":true,"value":"mount -n -t proc proc \/proc 
-o nosuid,noexec,nodev"}'
{code}
After launching parent container, we try to launch nested container. Agent 
[calls|https://github.com/apache/mesos/blob/46db7e4f27831d20244a57b22a70312f2a574395/src/slave/containerizer/mesos/containerizer.cpp#L1758]
  
[getMountNamespaceTarget|https://github.com/apache/mesos/blob/46db7e4f27831d20244a57b22a70312f2a574395/src/slave/containerizer/mesos/utils.cpp#L59]
 function, which returns the "Cannot get target mount namespace from process" 
error in this test.
If you take a look at it, you'll find that there is a small delay after 
enumerating all child processes (which might still contain running 
pre-exec-hook processes) and before calling {{ns::getns}} for each child 
process. During this delay any of pre-exec-hook processes might exit, hence 
causing this error message.

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] 

[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207669#comment-16207669
 ] 

Andrei Budnik commented on MESOS-7504:
--

https://reviews.apache.org/r/63074/
https://reviews.apache.org/r/63035/

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> 

[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207751#comment-16207751
 ] 

Andrei Budnik commented on MESOS-7504:
--

List of failing tests:
{{NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover}}
{{ROOT_CGROUPS_DebugNestedContainerInheritsEnvironment}}

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 

[jira] [Assigned] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7500:


Assignee: Andrei Budnik  (was: Gastón Kleiman)

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7500:
-
Story Points: 8  (was: 5)

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179000#comment-16179000
 ] 

Andrei Budnik commented on MESOS-7500:
--

The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass 
[--disable-fast-install|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179000#comment-16179000
 ] 

Andrei Budnik edited comment on MESOS-7500 at 9/25/17 2:18 PM:
---

The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass [\-\-enable-fast-install=no 
(--disable-fast-install)|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.


was (Author: abudnik):
The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass 
[--disable-fast-install|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8037) ns::clone should spawn process, which is a direct child

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184801#comment-16184801
 ] 

Andrei Budnik commented on MESOS-8037:
--

Health checks use their own procedure to enter namespaces, see 
https://github.com/apache/mesos/blob/7b79d8d4fb47aca05d28033f34a1f6b75dcfbe87/src/checks/checker_process.cpp#L103-L139

Health checks can't enter PID namespace. Also, the user (client code) of health 
checks should pass list of namespaces in specific order, because the order we 
enter namespaces is important. To solve these problems we could use 
{{ns::clone}}, but it returns a pid of a process, which is not our child, thus 
we can't get its return code which is needed for health checks.

Also, this feature can be used somehow in mesos containerizer, e.g. for logging 
status of an exited process: 
https://github.com/apache/mesos/blob/7b79d8d4fb47aca05d28033f34a1f6b75dcfbe87/src/slave/containerizer/mesos/linux_launcher.cpp#L480
  

> ns::clone should spawn process, which is a direct child
> ---
>
> Key: MESOS-8037
> URL: https://issues.apache.org/jira/browse/MESOS-8037
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>
> `ns::clone` does double-fork in order to be able to enter given PID namespace 
> and returns grandchild's pid, which is not a direct child of a parent 
> process, hence parent process can not retrieve status of an exited grandchild 
> process.
> As second fork is implemented via `os::clone`, we can pass `CLONE_PARENT` 
> flag. Also, we have to handle both intermediate child process and grandchild 
> process to avoid zombies.
> Motivation behind this improvement is that both `docker exec` and `LXC 
> attach` can enter process' PID namespace, while still controlling child's 
> status code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8037) ns::clone should spawn process, which is a direct child

2017-09-28 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8037:


 Summary: ns::clone should spawn process, which is a direct child
 Key: MESOS-8037
 URL: https://issues.apache.org/jira/browse/MESOS-8037
 Project: Mesos
  Issue Type: Improvement
Reporter: Andrei Budnik


`ns::clone` does double-fork in order to be able to enter given PID namespace 
and returns grandchild's pid, which is not a direct child of a parent process, 
hence parent process can not retrieve status of an exited grandchild process.
As second fork is implemented via `os::clone`, we can pass `CLONE_PARENT` flag. 
Also, we have to handle both intermediate child process and grandchild process 
to avoid zombies.

Motivation behind this improvement is that both `docker exec` and `LXC attach` 
can enter process' PID namespace, while still controlling child's status code.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184072#comment-16184072
 ] 

Andrei Budnik commented on MESOS-7500:
--

Command health checks are executed via `LAUNCH_NESTED_CONTAINER_SESSION` call 
and launched inside DEBUG container.
DEBUG container is always launched in pair with `mesos-io-switcboard` process. 
After spawning `mesos-io-switcboard` agent tries to connect to it via unix 
domain socket. If DEBUG container exits before `mesos-io-switcboard` exits, 
agent sends SIGTERM to switchboard process after 5 second delay. If 
`mesos-switchboard-process` exits after being killed by signal, then 
`LAUNCH_NESTED_CONTAINER_SESSION` call is considered to be failed as well as 
corresponding health check.
It turned out that `mesos-io-switchboard` is not an executable, but a special 
wrapper script generated by libtool. First time this script is executed, 
relinking of an executable triggered. Relinking takes quite a while on slow 
machines (e.g. in Apache CI): I've seen 8 seconds and more. It turned out, that 
when DEBUG container exits, agent sends SIGTERM (as described above) to a 
process which is still being relinking. This happens each time health check is 
launched and as the result we see a bunch of failed tests in Apache CI.
To fix this issue we need to force libtool/autotools to generate binary instead 
of wrapper script, see:
1. https://autotools.io/libtool/wrappers.html
2. `info libtool`

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184109#comment-16184109
 ] 

Andrei Budnik commented on MESOS-7500:
--

Example of related failing tests:
[ FAILED ] CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
[ FAILED ] CommandExecutorCheckTest.CommandCheckStatusChange
[ FAILED ] DefaultExecutorCheckTest.CommandCheckDeliveredAndReconciled
[ FAILED ] DefaultExecutorCheckTest.CommandCheckStatusChange
[ FAILED ] DefaultExecutorCheckTest.CommandCheckSeesParentsEnv
[ FAILED ] DefaultExecutorCheckTest.CommandCheckSharesWorkDirWithTask

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7586:
-
Description: 
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
“WARNING: Logging before InitGoogleLogging() is written to STDERR”
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
cout/cerr should be used before logging initialization.
glog should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use cout/cerr to eliminate any extra information generated 
by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on process::initialize().
Some parts of mesos don’t call logging::initialize(). This should also be fixed.

  was:
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
“WARNING: Logging before InitGoogleLogging() is written to STDERR”
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
cout/cerr should be used before logging initialization.
glog should be used after logging initialization.
 
Usually, main function has pattern like:
1. load = flags.load(argc, argv) // Load flags from command line.
2. Check if flags are correct, otherwise print error message to cerr and then 
exit.
3. Check if user passed --help flag to print help message to cout and then exit.
4. Parsing and setup of environment variables. If this fails, EXIT macro is 
used to print error message via glog.
5. process::initialize()
6. logging::initialize()
7. ...
 
Steps 2 and 3 should use cout/cerr to eliminate any extra information generated 
by glog like current time, date and log level.
It is possible to move step 6 between steps 3 and 4 safely, because 
logging::initialize() doesn’t depend on process::initialize().
Some parts of mesos don’t call logging::initialize(). This should also be fixed.


> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog. This leads to 
> message like:
> “WARNING: Logging before InitGoogleLogging() is written to STDERR”
> Also, messages via glog before logging is initialized might not end up in a 
> logdir.
>  
> The solution might be:
> cout/cerr should be used before logging initialization.
> glog should be used after logging initialization.
>  
> Usually, main function has initialization pattern like:
> # load = flags.load(argc, argv) // Load flags from command line.
> # Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> # Check if user passed --help flag to print help message to cout and then 
> exit.
> # Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> # process::initialize()
> # logging::initialize()
>  
> Steps 2 and 3 should use cout/cerr to eliminate any extra information 
> generated by glog like current time, date and log level.
> It would be preferable to move step 6 between steps 3 and 4 safely, because 
> {{logging::initialize()}} doesn’t depend on process::initialize().
> Some parts of mesos don’t call logging::initialize(). This should also be 
> fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7586:
-
Description: 
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
“WARNING: Logging before InitGoogleLogging() is written to STDERR”
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it necessary.

  was:
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
“WARNING: Logging before InitGoogleLogging() is written to STDERR”
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
cout/cerr should be used before logging initialization.
glog should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use cout/cerr to eliminate any extra information generated 
by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on process::initialize().
Some parts of mesos don’t call logging::initialize(). This should also be fixed.


> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog. This leads to 
> message like:
> “WARNING: Logging before InitGoogleLogging() is written to STDERR”
> Also, messages via glog before logging is initialized might not end up in a 
> logdir.
>  
> The solution might be:
> {{cout/cerr}} should be used before logging initialization.
> {{glog}} should be used after logging initialization.
>  
> Usually, main function has initialization pattern like:
> # load = flags.load(argc, argv) // Load flags from command line.
> # Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> # Check if user passed --help flag to print help message to cout and then 
> exit.
> # Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> # process::initialize()
> # logging::initialize()
>  
> Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
> generated by glog like current time, date and log level.
> It would be preferable to move step 6 between steps 3 and 4 safely, because 
> {{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
> In addition, initialization of glog should be added, where it necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7586:
-
Description: 
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it necessary.

  was:
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
“WARNING: Logging before InitGoogleLogging() is written to STDERR”
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it necessary.


> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog. This leads to 
> message like:
> bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
> Also, messages via glog before logging is initialized might not end up in a 
> logdir.
>  
> The solution might be:
> {{cout/cerr}} should be used before logging initialization.
> {{glog}} should be used after logging initialization.
>  
> Usually, main function has initialization pattern like:
> # load = flags.load(argc, argv) // Load flags from command line.
> # Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> # Check if user passed --help flag to print help message to cout and then 
> exit.
> # Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> # process::initialize()
> # logging::initialize()
>  
> Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
> generated by glog like current time, date and log level.
> It would be preferable to move step 6 between steps 3 and 4 safely, because 
> {{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
> In addition, initialization of glog should be added, where it necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7586:
-
Description: 
Some parts of mesos use glog before initialization of glog, hence messages via 
glog might not end up in a logdir:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR

The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it's necessary.

  was:
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it's necessary.


> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog, hence messages 
> via glog might not end up in a logdir:
> bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
> The solution might be:
> {{cout/cerr}} should be used before logging initialization.
> {{glog}} should be used after logging initialization.
>  
> Usually, main function has initialization pattern like:
> # load = flags.load(argc, argv) // Load flags from command line.
> # Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> # Check if user passed --help flag to print help message to cout and then 
> exit.
> # Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> # process::initialize()
> # logging::initialize()
>  
> Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
> generated by glog like current time, date and log level.
> It would be preferable to move step 6 between steps 3 and 4 safely, because 
> {{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
> In addition, initialization of glog should be added, where it's necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7586:
-
Description: 
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it's necessary.

  was:
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it necessary.


> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog. This leads to 
> message like:
> bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
> Also, messages via glog before logging is initialized might not end up in a 
> logdir.
>  
> The solution might be:
> {{cout/cerr}} should be used before logging initialization.
> {{glog}} should be used after logging initialization.
>  
> Usually, main function has initialization pattern like:
> # load = flags.load(argc, argv) // Load flags from command line.
> # Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> # Check if user passed --help flag to print help message to cout and then 
> exit.
> # Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> # process::initialize()
> # logging::initialize()
>  
> Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
> generated by glog like current time, date and log level.
> It would be preferable to move step 6 between steps 3 and 4 safely, because 
> {{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
> In addition, initialization of glog should be added, where it's necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6428) Mesos containerizer helper function signalSafeWriteStatus is not AS-Safe

2017-08-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143878#comment-16143878
 ] 

Andrei Budnik commented on MESOS-6428:
--

https://reviews.apache.org/r/61800/

> Mesos containerizer helper function signalSafeWriteStatus is not AS-Safe
> 
>
> Key: MESOS-6428
> URL: https://issues.apache.org/jira/browse/MESOS-6428
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
>Reporter: Benjamin Bannier
>Assignee: Jing Chen
>  Labels: newbie, tech-debt
>
> In {{src/slave/containerizer/mesos/launch.cpp}} a helper function 
> {{signalSafeWriteStatus}} is defined. Its name seems to suggest that this 
> function is safe to call in e.g., signal handlers, and it is used in this 
> file's {{signalHandler}} for exactly that purpose.
> Currently this function is not AS-Safe since it e.g., allocates memory via 
> construction of {{string}} instances, and might destructively modify 
> {{errno}}.
> We should clean up this function to be in fact AS-Safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7892) Filter results of `/state` on agent by role.

2017-09-05 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7892:
-
Fix Version/s: (was: 1.4.0)
   1.5.0

> Filter results of `/state` on agent by role.
> 
>
> Key: MESOS-7892
> URL: https://issues.apache.org/jira/browse/MESOS-7892
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>  Labels: mesosphere, security
> Fix For: 1.5.0
>
>
> The results returned by {{/state}} include data about resource reservations 
> per each role, which should be filtered for certain users, particularly in a 
> multi-tenancy scenario.
> The kind of leaked data includes specific role names and their specific 
> reservations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-04 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191456#comment-16191456
 ] 

Andrei Budnik commented on MESOS-7504:
--

{{(launch).failure(): Cannot get target mount namespace from process 10991: 
Cannot get 'mnt' namespace for 2nd-level child process '11001': Failed to stat 
mnt namespace handle for pid 11001: No such file or directory}}

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> 

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-11 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200341#comment-16200341
 ] 

Andrei Budnik commented on MESOS-7506:
--

I put a {{::sleep(2);}} after {{slave = this->StartSlave(detector.get(), 
containerizer.get(), flags);}} in 
[SlaveRecoveryTest.RecoverTerminatedExecutor|https://github.com/apache/mesos/blob/0908303142f641c1697547eb7f8e82a205d6c362/src/tests/slave_recovery_tests.cpp#L1634]
 and got:

{code}
../../src/tests/slave_recovery_tests.cpp:1656: Failure
  Expected: TASK_LOST
To be equal to: status->state()
  Which is: TASK_FAILED
{code}

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky

2017-10-11 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200383#comment-16200383
 ] 

Andrei Budnik commented on MESOS-8005:
--

{code}
[ RUN  ] SlaveTest.ShutdownUnregisteredExecutor
I0922 00:38:40.364121 31018 cluster.cpp:162] Creating default 'local' authorizer
I0922 00:38:40.365996 31034 master.cpp:445] Master 
83bd1613-70d9-4c3e-b490-4aa60dd26e22 (ip-172-16-10-25) started on 
172.16.10.25:44747
I0922 00:38:40.366019 31034 master.cpp:447] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/u6YBLG/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/u6YBLG/master" 
--zk_session_timeout="10secs"
I0922 00:38:40.366137 31034 master.cpp:497] Master only allowing authenticated 
frameworks to register
I0922 00:38:40.366145 31034 master.cpp:511] Master only allowing authenticated 
agents to register
I0922 00:38:40.366150 31034 master.cpp:524] Master only allowing authenticated 
HTTP frameworks to register
I0922 00:38:40.366155 31034 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/u6YBLG/credentials'
I0922 00:38:40.366237 31034 master.cpp:569] Using default 'crammd5' 
authenticator
I0922 00:38:40.366286 31034 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0922 00:38:40.366349 31034 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0922 00:38:40.366389 31034 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0922 00:38:40.366443 31034 master.cpp:649] Authorization enabled
I0922 00:38:40.366475 31039 hierarchical.cpp:171] Initialized hierarchical 
allocator process
I0922 00:38:40.366564 31038 whitelist_watcher.cpp:77] No whitelist given
I0922 00:38:40.367216 31036 master.cpp:2166] Elected as the leading master!
I0922 00:38:40.367238 31036 master.cpp:1705] Recovering from registrar
I0922 00:38:40.367282 31036 registrar.cpp:347] Recovering registrar
I0922 00:38:40.367449 31036 registrar.cpp:391] Successfully fetched the 
registry (0B) in 150016ns
I0922 00:38:40.367483 31036 registrar.cpp:495] Applied 1 operations in 5392ns; 
attempting to update the registry
I0922 00:38:40.367624 31034 registrar.cpp:552] Successfully updated the 
registry in 119808ns
I0922 00:38:40.367697 31034 registrar.cpp:424] Successfully recovered registrar
I0922 00:38:40.367858 31036 hierarchical.cpp:209] Skipping recovery of 
hierarchical allocator: nothing to recover
I0922 00:38:40.367869 31037 master.cpp:1804] Recovered 0 agents from the 
registry (142B); allowing 10mins for agents to re-register
I0922 00:38:40.368898 31018 containerizer.cpp:292] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0922 00:38:40.372519 31018 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0922 00:38:40.372859 31018 provisioner.cpp:255] Using default backend 'overlay'
W0922 00:38:40.375388 31018 process.cpp:3194] Attempted to spawn already 
running process files@172.16.10.25:44747
I0922 00:38:40.375486 31018 cluster.cpp:448] Creating default 'local' authorizer
I0922 00:38:40.375942 31036 slave.cpp:254] Mesos agent started on 
(531)@172.16.10.25:44747
W0922 00:38:40.376080 31018 process.cpp:3194] Attempted to spawn already 
running process version@172.16.10.25:44747
I0922 00:38:40.375958 31036 slave.cpp:255] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/SlaveTest_ShutdownUnregisteredExecutor_mhaf10/store/appc"
 --authenticate_http_executors="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209541#comment-16209541
 ] 

Andrei Budnik commented on MESOS-7506:
--

All failing tests have the same error message in logs like:
{{E0922 00:38:40.509032 31034 slave.cpp:5398] Termination of executor '1' of 
framework 83bd1613-70d9-4c3e-b490-4aa60dd26e22- failed: Failed to kill all 
processes in the container: Timed out after 1mins}}

The container termination future is triggered by 
[MesosContainerizerProcess::___destroy|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/containerizer/mesos/containerizer.cpp#L2361].
 Agent subscribes to this future by calling 
[containerizer->wait()|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/slave.cpp#L5280].
 Triggering this future leads to calling of {{Slave::executorTerminated}}, 
which sends {{TASK_FAILED}} status update.

Typical test (e.g. {{SlaveTest.ShutdownUnregisteredExecutor}}) waits for
{code}
  // Ensure that the slave times out and kills the executor.
  Future destroyExecutor =
FUTURE_DISPATCH(_, ::destroy);
{code}

After that, the test waits for {{TASK_FAILED}} status update. So, this test 
completes successfully and slave's destructor is called, [which 
fails|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/tests/cluster.cpp#L580],
 because {{MesosContainerizerProcess::___destroy}} doesn't erase container from 
the hashmap.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212605#comment-16212605
 ] 

Andrei Budnik commented on MESOS-7506:
--

Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.22  9279 containerizer.cpp:2671] Container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was 
called. So, I assume there is a race condition in {{TasksKiller::kill}}. 
Probably, {{cgroups::processes()}} called in {{TasksKiller::kill}} returns a 
list L1 which differs from a list L2 returned by the same function in 
{{cgroups::kill}}.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256897#comment-16256897
 ] 

Andrei Budnik commented on MESOS-7506:
--

https://reviews.apache.org/r/63887/
https://reviews.apache.org/r/63888/

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8247) Executor registered message is lost

2017-11-17 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8247:


 Summary: Executor registered message is lost
 Key: MESOS-8247
 URL: https://issues.apache.org/jira/browse/MESOS-8247
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik


h3. Brief description of successful agent-executor communication.
Executor sends `RegisterExecutorMessage` message to Agent during initialization 
step. Agent sends a `ExecutorRegisteredMessage` message as a response to the 
Executor in `registerExecutor()` method. Whenever executor receives 
`ExecutorRegisteredMessage`, it prints a `Executor registered on agent...` to 
stderr logs.

h3. Problem description.
The agent launches built-in docker executor, which is stuck in `STAGING` state.
stderr logs of the docker executor:
{code}
I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
{code}
It doesn't contain a message like `Executor registered on agent...`. At the 
same time agent received `RegisterExecutorMessage` and sent `runTask` message 
to the executor.

stdout logs consists of the same repeating message:
{code}
Received killTask for task ...
{code}
Also, the docker executor process doesn't contain child processes.

Currently, executor [doesn't 
attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
 to launch a task if it is not registered at the agent, while [task 
killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
 doesn't have such a check.

It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8247) Executor registered message is lost

2017-11-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257291#comment-16257291
 ] 

Andrei Budnik commented on MESOS-8247:
--

Related https://issues.apache.org/jira/browse/MESOS-3851 ?

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8137) Mesos agent can hang during startup.

2017-11-14 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251680#comment-16251680
 ] 

Andrei Budnik commented on MESOS-8137:
--

Do we have stack trace of all the threads?
What is the version of glibc?

> Mesos agent can hang during startup.
> 
>
> Key: MESOS-8137
> URL: https://issues.apache.org/jira/browse/MESOS-8137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Jie Yu
>Priority: Critical
>
> Environment:
> Linux dcos-agentdisks-as1-1100-2 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 
> 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> {noformat}
> #0  __lll_lock_wait_private () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f132b856f7b in __malloc_fork_lock_parent () at arena.c:155
> #2  0x7f132b89f5da in __libc_fork () at ../sysdeps/nptl/fork.c:131
> #3  0x7f132b842350 in _IO_new_proc_open (fp=fp@entry=0xf1282b84e0, 
> command=command@entry=0xf1282b6ea8 “logrotate --help > /dev/null”, 
> mode=, mode@entry=0xf1275fb0f2 “r”)
> at iopopen.c:180
> #4  0x7f132b84265c in _IO_new_popen (command=0xf1282b6ea8 “logrotate 
> --help > /dev/null”, mode=0xf1275fb0f2 “r”) at iopopen.c:296
> #5  0x00f1275e622a in Try os::shell<>(std::string 
> const&) ()
> #6  0x7f130fdbae37 in 
> mesos::journald::flags::Flags()::{lambda(std::string 
> const&)#2}::operator()(std::string const&) const (value=..., 
> __closure=)
> at /pkg/src/mesos-modules/journald/lib_journald.hpp:153
> #7  void flags::FlagsBase::add [10], mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}::operator()(flags::FlagsBase 
> const) const (base=..., __closure=) at 
> /opt/mesosphere/active/mesos/include/stout/flags/flags.hpp:399
> #8  std::_Function_handler

[jira] [Commented] (MESOS-8247) Executor registered message is lost

2017-11-21 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261281#comment-16261281
 ] 

Andrei Budnik commented on MESOS-8247:
--

Additional logs:
{code}
Nov 14 23:03:21 ip-xxx mesos-agent[2029]: E1114 23:03:21.049590  2057 
process.cpp:2431] Failed to shutdown socket with fd 320: Transport
 endpoint is not connected
Nov 14 23:03:21 ip-xxx mesos-agent[2029]: I1114 23:03:21.049783  2054 
slave.cpp:4484] Got exited event for executor(1)@xx.xx.yy.zzz:10895
{code}

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7506:
-
Attachment: ROOT_IsolatorFlags-badrun.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8137) Mesos agent can hang during startup.

2017-11-15 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253660#comment-16253660
 ] 

Andrei Budnik commented on MESOS-8137:
--

Probably related issues in glibc:
https://bugzilla.redhat.com/show_bug.cgi?id=1332917
https://bugzilla.redhat.com/show_bug.cgi?id=906468

> Mesos agent can hang during startup.
> 
>
> Key: MESOS-8137
> URL: https://issues.apache.org/jira/browse/MESOS-8137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Jie Yu
>Priority: Critical
>
> Environment:
> Linux dcos-agentdisks-as1-1100-2 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 
> 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> {noformat}
> #0  __lll_lock_wait_private () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f132b856f7b in __malloc_fork_lock_parent () at arena.c:155
> #2  0x7f132b89f5da in __libc_fork () at ../sysdeps/nptl/fork.c:131
> #3  0x7f132b842350 in _IO_new_proc_open (fp=fp@entry=0xf1282b84e0, 
> command=command@entry=0xf1282b6ea8 “logrotate --help > /dev/null”, 
> mode=, mode@entry=0xf1275fb0f2 “r”)
> at iopopen.c:180
> #4  0x7f132b84265c in _IO_new_popen (command=0xf1282b6ea8 “logrotate 
> --help > /dev/null”, mode=0xf1275fb0f2 “r”) at iopopen.c:296
> #5  0x00f1275e622a in Try os::shell<>(std::string 
> const&) ()
> #6  0x7f130fdbae37 in 
> mesos::journald::flags::Flags()::{lambda(std::string 
> const&)#2}::operator()(std::string const&) const (value=..., 
> __closure=)
> at /pkg/src/mesos-modules/journald/lib_journald.hpp:153
> #7  void flags::FlagsBase::add [10], mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}::operator()(flags::FlagsBase 
> const) const (base=..., __closure=) at 
> /opt/mesosphere/active/mesos/include/stout/flags/flags.hpp:399
> #8  std::_Function_handler

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-07 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241916#comment-16241916
 ] 

Andrei Budnik commented on MESOS-7506:
--

https://reviews.apache.org/r/63589/

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225515#comment-16225515
 ] 

Andrei Budnik edited comment on MESOS-7506 at 11/8/17 11:01 AM:


*First cause*

Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like 
this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406],
 where the clock is advanced by {{executor_registration_timeout}} and then it 
waits in a loop until a task status update is sent. This loop is executing 
while the container is being destroyed. At the same time, container destruction 
consists of multiple steps, one of them waits for [cgroups 
destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567].
 That means, we have a race between container destruction process and the loop 
that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. 
{{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction 
completes. That results in [leaving 
orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380]
 containers that will be detected by [Slave 
destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584]
 in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in 
the loop, which waits for a status update.


was (Author: abudnik):
Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like 
this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406],
 where the clock is advanced by {{executor_registration_timeout}} and then it 
waits in a loop until a task status update is sent. This loop is executing 
while the container is being destroyed. At the same time, container destruction 
consists of multiple steps, one of them waits for [cgroups 
destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567].
 That means, we have a race between container destruction process and the loop 
that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. 
{{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction 
completes. That results in [leaving 
orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380]
 containers that will be detected by [Slave 
destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584]
 in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in 
the loop, which waits for a status update.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243729#comment-16243729
 ] 

Andrei Budnik commented on MESOS-7506:
--

*Second cause*

{{[ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/default_executor_tests.cpp#L1912]}}
 launches task group, so each task is launched using {{ComposingContainerizer}}.
When this test completes (after receiving TASK_FINISHED status update), Slave 
d-tor is called, where [it 
waits|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]
 for each container to trigger a [container's termination 
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/mesos/containerizer.cpp#L2528].
As this test uses {{ComposingContainerizer}}, [calling 
destroy|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L572]
 for a container means {{ComposingContainerizer}} subscribes for the same 
[container's termination 
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/composing.cpp#L638-L647]
 via {{onAny}} method. Once this future is triggered, the lambda function is 
dispatched. This lambda removes {{containerId}} from the hash set.

When a container's termination future is set [is 
set|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1524],
 then 
{{[AWAIT(wait)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]}}
 might [be 
satisfied|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L83],
 hence container's hash set will be [requested 
(dispatched)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L577].
 There is a race between a thread which sets the container's termination 
future, calling {{onReadyCallbacks}} and {{onAnyCallbacks}}, where calling 
{{onAnyCallbacks}} leads to dispatching aforementioned lambda, and a test 
thread which waits for the container's termination future and then calls 
{{containerizer->containers()}}.

To reproduce this case, we need to add one sleep for ~10ms before 
[internal::run(copy->onAnyCallbacks, 
*this)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1537]
 and remove another sleep from [process::internal::await 
|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L92].

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky.

2017-11-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243738#comment-16243738
 ] 

Andrei Budnik commented on MESOS-7082:
--

https://issues.apache.org/jira/browse/MESOS-7506?focusedCommentId=16243729

> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is 
> flaky.
> -
>
> Key: MESOS-7082
> URL: https://issues.apache.org/jira/browse/MESOS-7082
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04 with/without SSL
> Fedora 23
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere
>
> Showed up on our internal CI
> {noformat}
> 07:00:17 [ RUN  ] 
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> 07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 
> 'local' authorizer
> 07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
> fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started 
> on 10.153.254.29:38570
> 07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
> --zk_session_timeout="10secs"
> 07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> 07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
> authenticated agents to register
> 07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> 07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/ZROfJk/credentials'
> 07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
> authenticator
> 07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
> 07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> 07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
> given
> 07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
> master!
> 07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from 
> registrar
> 07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
> 07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
> the registry (0B) in 176384ns
> 07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations 
> in 7573ns; attempting to update the registry
> 07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
> the registry in 210944ns
> 07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully 
> recovered registrar
> 07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
> the registry (172B); allowing 10mins for agents to re-register
> 07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery 
> of hierarchical allocator: nothing to recover
> 07:00:17 I0207 07:00:17.869381  2952 containerizer.cpp:220] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> 07:00:17 I0207 

[jira] [Issue Comment Deleted] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky.

2017-11-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7082:
-
Comment: was deleted

(was: 
https://issues.apache.org/jira/browse/MESOS-7506?focusedCommentId=16243729)

> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is 
> flaky.
> -
>
> Key: MESOS-7082
> URL: https://issues.apache.org/jira/browse/MESOS-7082
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04 with/without SSL
> Fedora 23
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere
>
> Showed up on our internal CI
> {noformat}
> 07:00:17 [ RUN  ] 
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> 07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 
> 'local' authorizer
> 07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
> fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started 
> on 10.153.254.29:38570
> 07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
> --zk_session_timeout="10secs"
> 07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> 07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
> authenticated agents to register
> 07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> 07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/ZROfJk/credentials'
> 07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
> authenticator
> 07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
> 07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> 07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
> given
> 07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
> master!
> 07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from 
> registrar
> 07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
> 07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
> the registry (0B) in 176384ns
> 07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations 
> in 7573ns; attempting to update the registry
> 07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
> the registry in 210944ns
> 07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully 
> recovered registrar
> 07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
> the registry (172B); allowing 10mins for agents to re-register
> 07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery 
> of hierarchical allocator: nothing to recover
> 07:00:17 I0207 07:00:17.869381  2952 containerizer.cpp:220] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> 07:00:17 I0207 

[jira] [Assigned] (MESOS-8172) Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0

2017-11-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8172:


Assignee: Greg Mann

> Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0
> --
>
> Key: MESOS-8172
> URL: https://issues.apache.org/jira/browse/MESOS-8172
> Project: Mesos
>  Issue Type: Bug
>  Components: executor, security
>Affects Versions: 1.4.0
> Environment: Ubuntu 16.04.3 with meso 1.4.0 compiled from source 
> tarball.
>Reporter: Dan Leary
>Assignee: Greg Mann
>
> Apparently the mesos-agent authenticate_http_executors commandline arg was 
> introduced in 1.3.0 by MESOS-6365.   But running "mesos-agent 
> --authenticate_http_executors ..." in 1.4.0 yields
> {noformat}
> Failed to load unknown flag 'authenticate_http_executors'
> {noformat}
> ...followed by a usage report that does not include 
> "--authenticate_http_executors".
> Presumably this means executor authentication is no longer configurable.
> It is still documented at 
> https://mesos.apache.org/documentation/latest/authentication/#agent



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-30 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225515#comment-16225515
 ] 

Andrei Budnik commented on MESOS-7506:
--

Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like 
this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406],
 where the clock is advanced by {{executor_registration_timeout}} and then it 
waits in a loop until a task status update is sent. This loop is executing 
while the container is being destroyed. At the same time, container destruction 
consists of multiple steps, one of them waits for [cgroups 
destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567].
 That means, we have a race between container destruction process and the loop 
that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. 
{{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction 
completes. That results in [leaving 
orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380]
 containers that will be detected by [Slave 
destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584]
 in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in 
the loop, which waits for a status update.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-05-14 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-8739:
-
Comment: was deleted

(was: Already covered by `SlaveRecoveryTest.KillTask` and some other 
`SlaveRecoveryTest.*` tests.)

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-05-14 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474384#comment-16474384
 ] 

Andrei Budnik commented on MESOS-8739:
--

This test case is already implicitly covered by `SlaveTest.*`.

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-05-14 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456526#comment-16456526
 ] 

Andrei Budnik edited comment on MESOS-8739 at 5/14/18 3:44 PM:
---

Already covered by `SlaveRecoveryTest.KillTask` and some other 
`SlaveRecoveryTest.*` tests.


was (Author: abudnik):
Already covered by `SlaveTest.*`

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8738) Implement a test to check that a recovered container can be killed.

2018-04-27 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456518#comment-16456518
 ] 

Andrei Budnik commented on MESOS-8738:
--

This test case is already covered by `SlaveRecoveryTest.KillTask`.

> Implement a test to check that a recovered container can be killed.
> ---
>
> Key: MESOS-8738
> URL: https://issues.apache.org/jira/browse/MESOS-8738
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test verifies that a recovered container can be killed via `destroy()` 
> method of composing containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-04-27 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456526#comment-16456526
 ] 

Andrei Budnik commented on MESOS-8739:
--

Already covered by `SlaveTest.*`

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8884) Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.

2018-05-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467476#comment-16467476
 ] 

Andrei Budnik commented on MESOS-8884:
--

[~zhitao] Thanks for the patch!

> Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.
> --
>
> Key: MESOS-8884
> URL: https://issues.apache.org/jira/browse/MESOS-8884
> Project: Mesos
>  Issue Type: Bug
> Environment: master-520b7298
>Reporter: Andrei Budnik
>Assignee: Zhitao Li
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_DOCKER_MaxCompletionTime-badrun.txt
>
>
> This test fails quite often in our internal CI.
> {code:java}
> ../../src/tests/containerizer/docker_containerizer_tests.cpp:663: Failure
> termination.get() is NONE
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8884) Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.

2018-05-04 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8884:


 Summary: Flaky 
`DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.
 Key: MESOS-8884
 URL: https://issues.apache.org/jira/browse/MESOS-8884
 Project: Mesos
  Issue Type: Bug
 Environment: master-520b7298
Reporter: Andrei Budnik
 Attachments: ROOT_DOCKER_MaxCompletionTime-badrun.txt

This test fails quite often in our internal CI.
{code:java}
../../src/tests/containerizer/docker_containerizer_tests.cpp:663: Failure
termination.get() is NONE
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8736) Implement a test which ensures that `wait` and `destroy` return the same result for a terminated nested container.

2018-05-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8736:


Assignee: Andrei Budnik

> Implement a test which ensures that `wait` and `destroy` return the same 
> result for a terminated nested container.
> --
>
> Key: MESOS-8736
> URL: https://issues.apache.org/jira/browse/MESOS-8736
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a nested container using a composing containerizer, then 
> checks that calling `destroy()` after `wait()` returns the same non-empty 
> container termination status as for `wait()`. After that, it kills parent 
> container and checks that both `destroy()` and `wait()` return an empty 
> termination status.
> Note that this test uses only Composing c'zer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8134) SlaveTest.ContainersEndpoint is flaky due to getenv crash.

2018-05-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479276#comment-16479276
 ] 

Andrei Budnik commented on MESOS-8134:
--

Steps to reproduce the race condition:
 1. Wrap [`os::getenv()` 
code|https://github.com/apache/mesos/blob/40b40d9b73221388e583fc140280f1eb2b48b832/src/slave/slave.cpp#L9948-L9951]
 with a loop `for (int i=0; i<1000 * 1000; ++i) {`
 2. Run {{./src/mesos-tests --gtest_filter=SlaveTest.ContainersEndpoint 
--gtest_break_on_failure --gtest_repeat=100 --verbose}}

> SlaveTest.ContainersEndpoint is flaky due to getenv crash.
> --
>
> Key: MESOS-8134
> URL: https://issues.apache.org/jira/browse/MESOS-8134
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test
>
> Looks like this test also has the getenv during setenv crash:
> {noformat}
> [ RUN  ] SlaveTest.ContainersEndpoint
> I1025 04:02:53.061488  6805 cluster.cpp:162] Creating default 'local' 
> authorizer
> I1025 04:02:53.065587  6824 master.cpp:445] Master 
> 2dc7ad46-f111-4762-9bf6-ef428a6f6d53 (a4020869f68c) started on 
> 172.17.0.2:38626
> I1025 04:02:53.065665  6824 master.cpp:447] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/lq9Ngb/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.5.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/lq9Ngb/master" --zk_session_timeout="10secs"
> I1025 04:02:53.066131  6824 master.cpp:496] Master only allowing 
> authenticated frameworks to register
> I1025 04:02:53.066145  6824 master.cpp:502] Master only allowing 
> authenticated agents to register
> I1025 04:02:53.066153  6824 master.cpp:508] Master only allowing 
> authenticated HTTP frameworks to register
> I1025 04:02:53.066165  6824 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/lq9Ngb/credentials'
> I1025 04:02:53.066561  6824 master.cpp:552] Using default 'crammd5' 
> authenticator
> I1025 04:02:53.066746  6824 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1025 04:02:53.066949  6824 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1025 04:02:53.067095  6824 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1025 04:02:53.067230  6824 master.cpp:631] Authorization enabled
> I1025 04:02:53.067430  6818 hierarchical.cpp:171] Initialized hierarchical 
> allocator process
> I1025 04:02:53.067477  6807 whitelist_watcher.cpp:77] No whitelist given
> I1025 04:02:53.070369  6825 master.cpp:2198] Elected as the leading master!
> I1025 04:02:53.070421  6825 master.cpp:1687] Recovering from registrar
> I1025 04:02:53.070796  6816 registrar.cpp:347] Recovering registrar
> I1025 04:02:53.071532  6816 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 690944ns
> I1025 04:02:53.071671  6816 registrar.cpp:495] Applied 1 operations in 
> 54316ns; attempting to update the registry
> I1025 04:02:53.072278  6816 registrar.cpp:552] Successfully updated the 
> registry in 538880ns
> I1025 04:02:53.072394  6816 registrar.cpp:424] Successfully recovered 
> registrar
> I1025 04:02:53.072808  6823 master.cpp:1791] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I1025 04:02:53.072983  6828 hierarchical.cpp:209] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W1025 04:02:53.077972  6805 process.cpp:3193] Attempted to spawn already 
> running process files@172.17.0.2:38626
> I1025 04:02:53.078305  6805 cluster.cpp:448] Creating default 'local' 

[jira] [Assigned] (MESOS-8740) Update description of a Containerizer interface.

2018-05-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8740:


Assignee: Andrei Budnik

> Update description of a Containerizer interface.
> 
>
> Key: MESOS-8740
> URL: https://issues.apache.org/jira/browse/MESOS-8740
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: documentaion, mesosphere
>
> [Containerizer 
> interface|https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.hpp]
>  must be updated with respect to the latest changes. In addition, it should 
> clearly describe semantics of `wait()` and `destroy()` methods, including 
> cases with a nested containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8829) Get rid of extra `containerizer->wait()` calls in tests.

2018-05-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8829:


Assignee: Andrei Budnik

> Get rid of extra `containerizer->wait()` calls in tests.
> 
>
> Key: MESOS-8829
> URL: https://issues.apache.org/jira/browse/MESOS-8829
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>
> Since both `wait()` and `destroy()` return the same result, we can get rid of 
> extra `containerizer->wait()` call in tests. E.g 
> [here|https://github.com/apache/mesos/blob/c662048ae365630e3249b51102c9f7f962cc24d3/src/tests/slave_recovery_tests.cpp#L2292-L2300]
>  and 
> [there|https://github.com/apache/mesos/blob/c662048ae365630e3249b51102c9f7f962cc24d3/src/tests/cluster.cpp#L654-L668]
>  as well as in some other places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8738) Implement a test to check that a recovered container can be killed.

2018-05-15 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456518#comment-16456518
 ] 

Andrei Budnik edited comment on MESOS-8738 at 5/15/18 4:16 PM:
---

This test case is already implicitly covered by `SlaveRecoveryTest.KillTask`, 
because we always use a composing containerizer in `StartSlave()`, see 
MESOS-8732.


was (Author: abudnik):
This test case is already covered by `SlaveRecoveryTest.KillTask`.

> Implement a test to check that a recovered container can be killed.
> ---
>
> Key: MESOS-8738
> URL: https://issues.apache.org/jira/browse/MESOS-8738
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test verifies that a recovered container can be killed via `destroy()` 
> method of composing containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-05-15 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474384#comment-16474384
 ] 

Andrei Budnik edited comment on MESOS-8739 at 5/15/18 4:14 PM:
---

This test case is already implicitly covered by `SlaveTest.*`, because we 
always use a composing containerizer in `StartSlave()`, see MESOS-8732.


was (Author: abudnik):
This test case is already implicitly covered by `SlaveTest.*`.

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8738) Implement a test to check that a recovered container can be killed.

2018-05-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8738:


Assignee: Andrei Budnik

> Implement a test to check that a recovered container can be killed.
> ---
>
> Key: MESOS-8738
> URL: https://issues.apache.org/jira/browse/MESOS-8738
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test verifies that a recovered container can be killed via `destroy()` 
> method of composing containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8828) Clock::advance can race with process::delay in tests.

2018-05-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479319#comment-16479319
 ] 

Andrei Budnik commented on MESOS-8828:
--

Another possible solution can be introducing `FUTURE_DELAY(M)` primitive, that 
returns a future which is set to ready when `delay(duration, pid, M)` is 
called. This primitive is kind of similar to `FUTURE_DISPATCH()`.

> Clock::advance can race with process::delay in tests.
> -
>
> Key: MESOS-8828
> URL: https://issues.apache.org/jira/browse/MESOS-8828
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky
> Attachments: failed_tests.txt
>
>
> There are lots of tests that use the following pattern:
>  1) [Pause 
> clocks|https://github.com/apache/mesos/blob/c662048ae365630e3249b51102c9f7f962cc24d3/src/tests/persistent_volume_tests.cpp#L1108]
>  2) [Start an 
> agent|https://github.com/apache/mesos/blob/c662048ae365630e3249b51102c9f7f962cc24d3/src/tests/persistent_volume_tests.cpp#L1122]
>  3) [Advance clocks to trigger an 
> event|https://github.com/apache/mesos/blob/c662048ae365630e3249b51102c9f7f962cc24d3/src/tests/persistent_volume_tests.cpp#L1125]
>  4) [Wait for the 
> event|https://github.com/apache/mesos/blob/c662048ae365630e3249b51102c9f7f962cc24d3/src/tests/persistent_volume_tests.cpp#L1127]
> If an event is scheduled via `process::delay()` after advancing the clocks, 
> then a test hangs in the endless wait for the event that is never triggered, 
> because libprocess clocks are paused. For example, 
> `DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0` 
> test hangs at step 4, because the clocks at step 3 has been already advanced 
> before the agent scheduled a call of 
> [Slave::authenticate()|https://github.com/apache/mesos/blob/ebe92c9b39933136968e4ba3a52527e52b361d22/src/slave/slave.cpp#L1301]
>  method. After a successful authentication with a master, the agent sends a 
> [UpdateSlaveMessage|https://github.com/apache/mesos/blob/ebe92c9b39933136968e4ba3a52527e52b361d22/src/slave/slave.cpp#L1546-L1550].
>  But the authentication process never finishes because 
> `[Slave::authenticate()|https://github.com/apache/mesos/blob/ebe92c9b39933136968e4ba3a52527e52b361d22/src/slave/slave.cpp#L1301]`
>  is never called.
> A list of tests that might be affected by the issue attached to this ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-3475) TestContainerizer should not modify global environment variables

2018-05-24 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-3475:


Assignee: Andrei Budnik

> TestContainerizer should not modify global environment variables
> 
>
> Key: MESOS-3475
> URL: https://issues.apache.org/jira/browse/MESOS-3475
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joris Van Remoortere
>Assignee: Andrei Budnik
>Priority: Major
>
> Currently the {{TestContainerizer}} modifies the environment variables. Since 
> these are global variables, this can cause other threads reading these 
> variables to get inconsistent results, or even segfault if they happen to 
> read while the environment is being changed.
> Synchronizing within the TestContainerizer is not sufficient. We should pass 
> the environment variables into a fork, or set them on the command line of an 
> execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8951) Flaky `AgentContainerAPITest.RecoverNestedContainer`

2018-05-24 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8951:


 Summary: Flaky `AgentContainerAPITest.RecoverNestedContainer`
 Key: MESOS-8951
 URL: https://issues.apache.org/jira/browse/MESOS-8951
 Project: Mesos
  Issue Type: Bug
 Environment: internal CI
 master-668030da
Reporter: Andrei Budnik
 Attachments: AgentContainerAPITest.RecoverNestedContainer-badrun1.txt, 
AgentContainerAPITest.RecoverNestedContainer-badrun2.txt

{code:java}
[  FAILED  ] 
ParentChildContainerTypeAndContentType/AgentContainerAPITest.RecoverNestedContainer/9,
 where GetParam() = (1, 0, application/json, 
("cgroups/cpu,cgroups/mem,filesystem/linux,namespaces/pid", "linux", 
"ROOT_CGROUPS_")) (15297 ms)
[  FAILED  ] 
ParentChildContainerTypeAndContentType/AgentContainerAPITest.RecoverNestedContainer/13,
 where GetParam() = (1, 1, application/json, 
("cgroups/cpu,cgroups/mem,filesystem/linux,namespaces/pid", "linux", 
"ROOT_CGROUPS_")) (15275 ms){code}
{code:java}
../../src/tests/agent_container_api_tests.cpp:596
Failed to wait 15secs for wait
{code}
There is no call of `WAIT_CONTAINER` in agent logs. It looks like the request 
wasn't delivered to the agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-05-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8739:


Assignee: Andrei Budnik

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9029) Seccomp syscall filtering in Mesos containerizer

2018-06-26 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9029:


 Summary: Seccomp syscall filtering in Mesos containerizer
 Key: MESOS-9029
 URL: https://issues.apache.org/jira/browse/MESOS-9029
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik


This epic is meant to collect all the tickets related to implementation of 
Seccomp filtering on the Mesos agent via `linux/seccomp` isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9024) Mesos master segfaults with stack overflow under load

2018-06-23 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521101#comment-16521101
 ] 

Andrei Budnik commented on MESOS-9024:
--

May you please add repeating part of the stack trace to the description?

> Mesos master segfaults with stack overflow under load
> -
>
> Key: MESOS-9024
> URL: https://issues.apache.org/jira/browse/MESOS-9024
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, master
>Affects Versions: 1.6.0
> Environment: Ubuntu 16.04.4 
>Reporter: Andrew Ruef
>Priority: Major
> Attachments: stack.txt.gz
>
>
> Running mesos in non-HA mode on a small cluster under load, the master 
> reliably segfaults due to some state it has worked itself into. The segfault 
> appears to be a stack overflow, at least, the call stack has 72662 elements 
> in it in the crashing thread. The root of the stack appears to be in 
> libprocess. 
> I've attached a gzip compressed stack backtrace since the uncompressed stack 
> backtrace is too large to attach to this issue. This happens to me fairly 
> reliably when doing jobs, but it can take many hours or days for mesos to 
> work itself back into this state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8734) Restore `WaitAfterDestroy` test to check termination status of a terminated nested container.

2018-04-26 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8734:


Assignee: Andrei Budnik

> Restore `WaitAfterDestroy` test to check termination status of a terminated 
> nested container.
> -
>
> Key: MESOS-8734
> URL: https://issues.apache.org/jira/browse/MESOS-8734
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> It's important to check that after termination of a nested container, its 
> termination status is available. This property is used in default executor.
> Note that the test uses Mesos c'zer and checks above-mentioned property only 
> for Mesos c'zer.
> Right now, if we remove [this section of 
> code|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/mesos/containerizer.cpp#L2093-L2111],
>  no test will be broken!
> https://reviews.apache.org/r/65505



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2018-04-30 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458823#comment-16458823
 ] 

Andrei Budnik commented on MESOS-6285:
--

Introducing a limit for the number of stored tasks per executor and/or 
framework in the garbage collector can solve the issue.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8871) Agent may fail to recover if the agent dies before image store cache checkpointed.

2018-05-03 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462355#comment-16462355
 ] 

Andrei Budnik commented on MESOS-8871:
--

This issue has been reproduced once in our internal testing cluster. Fixed by 
removing '/var/lib/mesos/slave/store/docker/storedImages' file.

> Agent may fail to recover if the agent dies before image store cache 
> checkpointed.
> --
>
> Key: MESOS-8871
> URL: https://issues.apache.org/jira/browse/MESOS-8871
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Gilbert Song
>Priority: Major
>  Labels: mesosphere, slave
>
> {noformat}
> E0502 13:51:45.398555 10100 slave.cpp:7305] EXIT with status 1: Failed to 
> perform recovery: Collect failed: Collect failed: Collect failed: Unexpected 
> empty images file '/var/lib/mesos/slave/store/docker/storedImages'
> {noformat}
> This may happen if the agent dies after the file is created but before the 
> contents are persisted on disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212605#comment-16212605
 ] 

Andrei Budnik edited comment on MESOS-7506 at 10/20/17 6:28 PM:


Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.22  9279 containerizer.cpp:2671] Container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called.


was (Author: abudnik):
Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 

[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212605#comment-16212605
 ] 

Andrei Budnik edited comment on MESOS-7506 at 10/20/17 6:30 PM:


Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 17:59:05.049862 16817 cgroups.cpp:1563] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.049876 16817 cgroups.cpp:3085] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.050351 16817 cgroups.cpp:1398] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.051440 16817 cgroups.cpp:1423] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7: FREEZING
I1020 17:59:05.051749 16819 cgroups.cpp:1398] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.052760 16819 cgroups.cpp:1416] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7 after 1secs
I1020 17:59:05.052858 16819 hierarchical.cpp:1488] Performed allocation for 1 
agents in 15715ns
I1020 17:59:05.052901 16819 cgroups.cpp:1574] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.053357 16819 cgroups.cpp:1584] TasksKiller::kill: reap: 31229
I1020 17:59:05.053377 16819 cgroups.cpp:1584] TasksKiller::kill: reap: 31243
I1020 17:59:05.054193 16819 cgroups.cpp:928] cgroups::kill: 31229
I1020 17:59:05.054206 16819 cgroups.cpp:928] cgroups::kill: 31243
I1020 17:59:05.054262 16819 cgroups.cpp:1598] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.054272 16819 cgroups.cpp:3103] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.054757 16819 cgroups.cpp:1432] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.057647 16819 cgroups.cpp:1449] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7 after 0ns
I1020 17:59:05.057842 16816 cgroups.cpp:1604] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called.


was (Author: abudnik):
Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.22  9279 containerizer.cpp:2671] Container 

[jira] [Assigned] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7742:


Assignee: Andrei Budnik

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8391:


Assignee: Andrei Budnik  (was: Gilbert Song)

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Andrei Budnik
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320409#comment-16320409
 ] 

Andrei Budnik edited comment on MESOS-8391 at 1/10/18 6:47 PM:
---

https://reviews.apache.org/r/65071/
https://reviews.apache.org/r/65077/


was (Author: abudnik):
https://reviews.apache.org/r/65071/

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Andrei Budnik
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7742:
-
Sprint: Mesosphere Sprint 58, Mesosphere Sprint 72  (was: Mesosphere Sprint 
58)

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7506:
-
Attachment: ROOT_IsolatorFlags-badrun2.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318567#comment-16318567
 ] 

Andrei Budnik commented on MESOS-7742:
--

How to reproduce Flavour 3:
Put a {{::sleep(1);}} before {{writer.close();}} in 
[Http::_attachContainerInput()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3222].

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318666#comment-16318666
 ] 

Andrei Budnik commented on MESOS-7742:
--

As we have launched 
[`cat`|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/tests/api_tests.cpp#L6529]
 command as a nested container, related ioswitchboard process will be in the 
same process group. Whenever a process group leader ({{cat}}) terminates, all 
processes in the process group are killed, including ioswitchboard.
ioswitchboard handles HTTP requests from the slave, e.g. 
{{ATTACH_CONTAINER_INPUT}} request in this test.
Usually, after reading all client's data, {{Http::_attachContainerInput()}} 
invokes a callback which calls 
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3223].
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L561]
 implies sending a 
[\r\n\r\n|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1045]
 to the ioswitchboard process.
ioswitchboard returns [200 
OK|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/containerizer/mesos/io/switchboard.cpp#L1572]
 response, hence agent returns {{200 OK}} for {{ATTACH_CONTAINER_INPUT}} 
request as expected.

However, if ioswitchboard terminates before it receives {{\r\n\r\n}} or before 
agent receives {{200 OK}} response from the ioswitchboard, connection (via unix 
socket) might be closed, so corresponding {{ConnectionProcess}} will handle 
this case as an unexpected [EOF| 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
 during 
[read|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
 of a response. That will lead to {{500 Internal Server Error}} response from 
the agent.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7028:
-
Attachment: NetSocketTest.EOFBeforeRecv-vlog3.txt

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-15 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326364#comment-16326364
 ] 

Andrei Budnik commented on MESOS-7028:
--

Steps to reproduce:
1) Add a `{{::sleep(1);}}` after 
[server_socket.shutdown()|https://github.com/apache/mesos/blob/4959887230a7d7c55629083be978810f48b780a3/3rdparty/libprocess/src/tests/socket_tests.cpp#L195]
 
2) recompile `make check`
3) launch the test:
{code:java}
GLOG_v=3 sudo GLOG_v=3 ./3rdparty/libprocess/libprocess-tests 
--gtest_filter=Encryption/NetSocketTest.EOFBeforeRecv/0 
--gtest_break_on_failure --gtest_repeat=1 --verbose
{code}

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329006#comment-16329006
 ] 

Andrei Budnik commented on MESOS-7742:
--

These patches ^^ are fixing the first cause described in the [first 
patch|https://reviews.apache.org/r/65122/].

There is a second cause when an attempt to connect to IO-Switchboard fails with:
{code:java}
I1109 23:47:25.016929 27803 process.cpp:3982] Failed to process request for 
'/slave(812)/api/v1': Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused
W1109 23:47:25.017009 27803 http.cpp:2944] Failed to attach to nested container 
7ab572dd-78b5-4186-93af-7ac011990f80.b77944da-f1d5-4694-a51b-8fde150c5f7a: 
Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused
I1109 23:47:25.017063 27803 process.cpp:1590] Returning '500 Internal Server 
Error' for '/slave(812)/api/v1' (Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused)
{code}
The reason for this failure needs to be investigated.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330630#comment-16330630
 ] 

Andrei Budnik commented on MESOS-7742:
--

Steps to reproduce second cause:
1. Add a {{::sleep(2);}} after [binding unix 
socket|https://github.com/apache/mesos/blob/634c8af2618c57a1405d20717fa909b399486f37/src/slave/containerizer/mesos/io/switchboard.cpp#L1056].
2. Recompile `make && make check`.
3. Launch a test:
{code:}
GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
--gtest_filter=ContentType/AgentAPITest.LaunchNestedContainerSession/0 
--gtest_break_on_failure --gtest_repeat=1 --verbose
{code}


> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7028:


Assignee: Andrei Budnik  (was: Greg Mann)

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7028:
-
  Story Points: 5
Remaining Estimate: (was: 5m)
 Original Estimate: (was: 5m)

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7028:
-
Sprint: Mesosphere Sprint 72
Remaining Estimate: 5m
 Original Estimate: 5m

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >