[jira] [Created] (MESOS-9193) Mesos build fail with Clang 3.5

2018-08-29 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9193:
--

 Summary: Mesos build fail with Clang 3.5
 Key: MESOS-9193
 URL: https://issues.apache.org/jira/browse/MESOS-9193
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


1. The `-Wno-inconsistent-missing-override` option added in 
https://reviews.apache.org/r/67953/
is not recognized by clang 3.5.
2. The same issue described in https://reviews.apache.org/r/55400/ would make
`src/resource_provider/storage/provider.cpp` fail to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9189) Include 'Connection: close' header in streaming API responses.

2018-08-29 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596923#comment-16596923
 ] 

Benjamin Mahler commented on MESOS-9189:


{quote}
'Connection: close' set by a server is an indicator for the client to close the 
connection after receiving the complete response
{quote}

Perhaps this is the root of the confusion, as that's not quite what it means. 
See the following from RFC 7230:

In RFC 7230 section 6.1 ("sender" instead of "client" is intentional here, and 
in this case of this ticket, the "sender" is the server):

{quote}
   The "close" connection option is defined for a sender to signal that
   this connection will be closed after completion of the response.  For
   example,

 Connection: close

   in either the request or the response header fields indicates that
   the sender is going to close the connection after the current
   request/response is complete (Section 6.6).
{quote}

In RFC 7230 section 6.6:

{quote}
A server that sends a "close" connection option MUST initiate a close
of the connection (see below) after it sends the response containing
"close".  The server MUST NOT process any further requests received
on that connection.

A client that receives a "close" connection option MUST cease sending
requests on that connection and close the connection after reading
the response message containing the "close"; if additional pipelined
requests had been sent on the connection, the client SHOULD NOT
assume that they will be processed by the server.
{quote}

Let's ignore pipelining for now in this discussion as most intermediaries avoid 
it as far as I'm aware.

{quote}
Is the hope here is that some middlemen peek into the 'Connection' header and 
based on it decide whether to close the connection themselves when their client 
disconnects even though the response might not be completed?
{quote}

Yes, in this ticket we're assuming the intermediary does not try to read the 
full response before forwarding it. If it did, the streaming API would not work 
at all through such an intermediary since the response is infinite. In 
addition, intermediaries MUST perform certain actions with the headers before 
forwarding the response (e.g. the spec requires intermediaries to look at 
'Connection' header before forwarding, see RFC 7230 section 6.1).

In terns of how sending 'Connection: close' will help: an intermediary that 
sees this knows that the server MUST initiate a close of the connection upon 
finishing sending the response. If the intermediary was planning to re-use the 
connection, it knows that cannot when it sees the header (because both the 
server MUST close it, and the intermediary MUST cease sending requests on that 
connection). If it knows this and it sees the end-client disconnect, its best 
choice is to close the connection to the server at that point in time.

> Include 'Connection: close' header in streaming API responses.
> --
>
> Key: MESOS-9189
> URL: https://issues.apache.org/jira/browse/MESOS-9189
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections 
> to mesos as an optimization to avoid re-connection overhead. As a result, 
> when the end-client of the streaming API disconnects from the intermediary, 
> the intermediary leaves the connection to mesos open in an attempt to re-use 
> the connection for another request once the response completes. Mesos then 
> thinks that the subscriber never disconnected and the intermediary happily 
> continues to read the streaming events even though there's no end-client.
> To help indicate to intermediaries that the connection SHOULD NOT be re-used, 
> we can set the 'Connection: close' header for streaming API responses. It may 
> not be respected (since the language seems to be SHOULD NOT), but some 
> intermediaries may respect it and close the connection if the end-client 
> disconnects.
> Note that libprocess' http server currently doesn't close the the connection 
> based on a handler setting this header, but it doesn't matter here since the 
> streaming API responses are infinite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9191) Docker command executor may stuck at infinite unkillable loop.

2018-08-29 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9191:
-

Shepherd: Qian Zhang
Assignee: Andrei Budnik
  Sprint: Mesosphere Sprint 2018-28

[~abudnik] Would you have cycles in the next sprint work on this?

> Docker command executor may stuck at infinite unkillable loop.
> --
>
> Key: MESOS-9191
> URL: https://issues.apache.org/jira/browse/MESOS-9191
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: containerizer
>
> Due to the change from https://issues.apache.org/jira/browse/MESOS-8574, the 
> behavior of docker command executor to discard the future of docker stop was 
> changed. If there is a new killTask() invoked and there is an existing docker 
> stop in pending state, the old one would call discard and then execute the 
> new one. This is ok for most of cases.
> However, docker stop could take long (depends on grace period and whether the 
> application could handle SIGTERM). If the framework retry killTask more 
> frequently than grace period (depends on killpolicy API, env var, or agent 
> flags), then the executor may be stuck forever with unkillable tasks. Because 
> everytime before the docker stop finishes, the future of docker stop is 
> discarded by the new incoming killTask.
> We should consider re-use grace period before calling discard() to a pending 
> docker stop future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9192) Mesos build fail on Ubuntu 14.04.

2018-08-29 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596866#comment-16596866
 ] 

James Peach commented on MESOS-9192:


Per [the docs|http://mesos.apache.org/documentation/latest/building/] we 
require clang >= 3.5. Maybe we ought to add a version check to the build like 
we did for GCC?

> Mesos build fail on Ubuntu 14.04.
> -
>
> Key: MESOS-9192
> URL: https://issues.apache.org/jira/browse/MESOS-9192
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Priority: Major
>
> Ubuntu 14.04, clang3.4
> If I manually install protobuf-compiler, the build will pass.
> {noformat}
> make[3]: Entering directory 
> `/home/mengzhu/workspace/mesos_current/build/3rdparty'
> cd grpc-1.10.0 &&   \
>   
> CPPFLAGS="-I/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src
>\
> \
> \
> -Wno-array-bounds   \
> -I/usr/include/subversion-1 -I/usr/include/apr-1 
> -I/usr/include/apr-1.0   " \
>   CFLAGS="-g1 -O0"  \
>   CXXFLAGS="-g1 -O0 -Wno-inconsistent-missing-override -std=c++11"
>   \
>   make  \
> 
> /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc++_unsecure.a
>  
> /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc_unsecure.a
>  
> /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgpr.a
> \
> CC="clang"  \
> CXX="clang++"   \
> LD="clang"  \
> LDXX="clang++"  \
> 
> LDFLAGS="-L/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/.libs
> \
> \
> \
>  "  \
> LDLIBS=""   \
> HAS_PKG_CONFIG=false\
> NO_PROTOC=false \
> 
> PROTOC="/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/protoc"
> make[4]: Entering directory 
> `/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0'
> DEPENDENCY ERROR
> The target you are trying to run requires protobuf 3.0.0+
> Your system doesn't have it, and neither does the third_party directory.
> Please consult INSTALL to get more information.
> If you need information about why these tests failed, run:
>   make run_dep_checks
> make[4]: *** [stop] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9192) Mesos build fail on Ubuntu 14.04.

2018-08-29 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9192:
---

 Summary: Mesos build fail on Ubuntu 14.04.
 Key: MESOS-9192
 URL: https://issues.apache.org/jira/browse/MESOS-9192
 Project: Mesos
  Issue Type: Bug
Reporter: Meng Zhu


Ubuntu 14.04, clang3.4
If I manually install protobuf-compiler, the build will pass.

{noformat}
make[3]: Entering directory 
`/home/mengzhu/workspace/mesos_current/build/3rdparty'
cd grpc-1.10.0 &&   \
  
CPPFLAGS="-I/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src
   \
\
\
-Wno-array-bounds   \
-I/usr/include/subversion-1 -I/usr/include/apr-1 
-I/usr/include/apr-1.0   " \
  CFLAGS="-g1 -O0"  \
  CXXFLAGS="-g1 -O0 -Wno-inconsistent-missing-override -std=c++11"  
\
  make  \

/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc++_unsecure.a
 
/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc_unsecure.a
 
/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgpr.a
\
CC="clang"  \
CXX="clang++"   \
LD="clang"  \
LDXX="clang++"  \

LDFLAGS="-L/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/.libs
\
\
\
 "  \
LDLIBS=""   \
HAS_PKG_CONFIG=false\
NO_PROTOC=false \

PROTOC="/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/protoc"
make[4]: Entering directory 
`/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0'

DEPENDENCY ERROR

The target you are trying to run requires protobuf 3.0.0+
Your system doesn't have it, and neither does the third_party directory.

Please consult INSTALL to get more information.

If you need information about why these tests failed, run:

  make run_dep_checks

make[4]: *** [stop] Error 1
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7076) libprocess tests fail when using libevent 2.1.8

2018-08-29 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596762#comment-16596762
 ] 

Till Toenshoff edited comment on MESOS-7076 at 8/29/18 7:51 PM:


[~arojas] yes I saw that. My current plan is as follows;

1st; integrate version checks into our build systems that make sure we don't 
run into known incompatible version combinations
2nd; bundle non problematic libevent-2.0.22 and libssl-1.0.2p with Mesos 
(libprocess) to make sure "it just works™" while (1) still secures unbundled 
builds
3rd; dive deep again into actual debugging while totally involving the libevent 
mailing list

Generally speaking, (1) and (2) are commonly a good idea in my opinion as long 
as unbundled builds are an option.


was (Author: tillt):
[~arojas] yes I saw that. My current plan is as follows;

1st; integrate version checks into our build systems that make sure we don't 
run into known incompatible version combinations
2nd; bundle non problematic libevent-2.0.22 and libssl-1.0.2g with Mesos 
(libprocess) to make sure "it just works™" while (1) still secures unbundled 
builds
3rd; dive deep again into actual debugging while totally involving the libevent 
mailing list

Generally speaking, (1) and (2) are commonly a good idea in my opinion as long 
as unbundled builds are an option.

> libprocess tests fail when using libevent 2.1.8
> ---
>
> Key: MESOS-7076
> URL: https://issues.apache.org/jira/browse/MESOS-7076
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess, test
> Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew)
>Reporter: Jan Schlicht
>Assignee: Till Toenshoff
>Priority: Critical
>  Labels: ci
>
> Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent 
> --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests 
> fail like
> {noformat}
> [ RUN  ] SSLTest.SSLSocket
> I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure
> Failed to wait 15secs for Socket(socket.get()).recv()
> [  FAILED  ] SSLTest.SSLSocket (15196 ms)
> {noformat}
> Tests failing are
> {noformat}
> SSLTest.SSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ECDHESupport
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> SSLTest.SilentSocket
> SSLTest.ShutdownThenSend
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7076) libprocess tests fail when using libevent 2.1.8

2018-08-29 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596762#comment-16596762
 ] 

Till Toenshoff commented on MESOS-7076:
---

[~arojas] yes I saw that. My current plan is as follows;

1st; integrate version checks into our build systems that make sure we don't 
run into known incompatible version combinations
2nd; bundle non problematic libevent-2.0.22 and libssl-1.0.2g with Mesos 
(libprocess) to make sure "it just works™" while (1) still secures unbundled 
builds
3rd; dive deep again into actual debugging while totally involving the libevent 
mailing list

Generally speaking, (1) and (2) are commonly a good idea in my opinion as long 
as unbundled builds are an option.

> libprocess tests fail when using libevent 2.1.8
> ---
>
> Key: MESOS-7076
> URL: https://issues.apache.org/jira/browse/MESOS-7076
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess, test
> Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew)
>Reporter: Jan Schlicht
>Assignee: Till Toenshoff
>Priority: Critical
>  Labels: ci
>
> Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent 
> --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests 
> fail like
> {noformat}
> [ RUN  ] SSLTest.SSLSocket
> I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure
> Failed to wait 15secs for Socket(socket.get()).recv()
> [  FAILED  ] SSLTest.SSLSocket (15196 ms)
> {noformat}
> Tests failing are
> {noformat}
> SSLTest.SSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ECDHESupport
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> SSLTest.SilentSocket
> SSLTest.ShutdownThenSend
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9159) Support Foreign URLs in docker registry puller

2018-08-29 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596757#comment-16596757
 ] 

Jie Yu commented on MESOS-9159:
---

commit 3295fc98cf33bf22bb3d7b1d1ade424c477d3b83
Author: Liangyu Zhao 
Date:   Wed Aug 29 11:54:47 2018 -0700

Windows: Enabled `DockerFetcherPluginTest` suite.

Enabled `Internet` test environment on Windows. Disabled `Internet`
`HealthCheckTests` on Windows, since they require complete
development. Modified `DockerFetcherPluginTest` to fetch
`microsoft/nanoserver` for more extensive test for fetcher on Windows.

Review: https://reviews.apache.org/r/67930/

commit cdf8eab619239600f5105965b676b13887931f91
Author: Liangyu Zhao 
Date:   Wed Aug 29 11:54:25 2018 -0700

Windows: Enable DockerFetcher in Windows agent.

Review: https://reviews.apache.org/r/68455/

> Support Foreign URLs in docker registry puller
> --
>
> Key: MESOS-9159
> URL: https://issues.apache.org/jira/browse/MESOS-9159
> Project: Mesos
>  Issue Type: Task
>Reporter: Akash Gupta
>Assignee: Liangyu Zhao
>Priority: Major
>
> Currently, trying to pull the layers of a Windows image with the current 
> registry pull code will return a 404 error. This is because the Windows 
> docker images need to pull the base OS layers from the foreign URLs field in 
> the version 2 schema 2 docker manifest. As a result, the register puller 
> needs to be aware of version 2 schema 2 and the foreign urls field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8976) MasterTest.LaunchDuplicateOfferLost is flaky

2018-08-29 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596722#comment-16596722
 ] 

Joseph Wu edited comment on MESOS-8976 at 8/29/18 6:52 PM:
---

The {{src/test/utils.cpp:64}} helper that failed is:
{code}
JSON::Object Metrics()
{
  UPID upid("metrics", process::address());

  /* For some reason, this call times out and never completes. */
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  /* The `response->body` below is basically an unguarded `Future::get` 
because the two AWAIT calls above are the EXPECT variety, meaning 
that they do not return from the function when they fail. */
  Try parse = JSON::parse(response->body);
  CHECK_SOME(parse);

  return parse.get();
}
{code}


was (Author: kaysoky):
The {{src/test/utils.cpp:64}} helper that failed is:
{code}
JSON::Object Metrics()
{
  UPID upid("metrics", process::address());

  /* For some reason, this call times out and never completes. */
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  /* The `response->body` below is basically an unguarded `Future::get` because 
the
two AWAIT calls above are the EXPECT variety, meaning that they do not 
return
from the function when they fail. */
  Try parse = JSON::parse(response->body);
  CHECK_SOME(parse);

  return parse.get();
}
{code}

> MasterTest.LaunchDuplicateOfferLost is flaky
> 
>
> Key: MESOS-8976
> URL: https://issues.apache.org/jira/browse/MESOS-8976
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: LaunchDuplicateOfferLost.jenkins-faillog
>
>
> In an internal CI run, we observed a failure with this test where the 
> scheduler seemed to be stuck repeatedly allocating resources to the agent for 
> about 1 hour before getting timed out. See attached log for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8976) MasterTest.LaunchDuplicateOfferLost is flaky

2018-08-29 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596722#comment-16596722
 ] 

Joseph Wu commented on MESOS-8976:
--

The {{src/test/utils.cpp:64}} helper that failed is:
{code}
JSON::Object Metrics()
{
  UPID upid("metrics", process::address());

  /* For some reason, this call times out and never completes. */
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  /* The `response->body` below is basically an unguarded `Future::get` because 
the
two AWAIT calls above are the EXPECT variety, meaning that they do not 
return
from the function when they fail. */
  Try parse = JSON::parse(response->body);
  CHECK_SOME(parse);

  return parse.get();
}
{code}

> MasterTest.LaunchDuplicateOfferLost is flaky
> 
>
> Key: MESOS-8976
> URL: https://issues.apache.org/jira/browse/MESOS-8976
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: LaunchDuplicateOfferLost.jenkins-faillog
>
>
> In an internal CI run, we observed a failure with this test where the 
> scheduler seemed to be stuck repeatedly allocating resources to the agent for 
> about 1 hour before getting timed out. See attached log for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8976) MasterTest.LaunchDuplicateOfferLost is flaky

2018-08-29 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596700#comment-16596700
 ] 

Chun-Hung Hsiao commented on MESOS-8976:


This is caused by MESOS-6231. The following 
[code|https://github.com/apache/mesos/blob/959fa0bbe6dcde60262bc131f851f5bb2d709d57/src/tests/utils.cpp#L59-L67]
 is stuck because the {{/metrics/snapshot}} is pending for more than 1hr:
{code:cpp}
  // TODO(neilc): This request might timeout if the current value of a
  // metric cannot be determined. In tests, a common cause for this is
  // MESOS-6231 when multiple scheduler drivers are in use.
  Future response = http::get(upid, "snapshot");

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(http::OK().status, response);
  AWAIT_EXPECT_RESPONSE_HEADER_EQ(APPLICATION_JSON, "Content-Type", response);

  Try parse = JSON::parse(response->body);
{code}

> MasterTest.LaunchDuplicateOfferLost is flaky
> 
>
> Key: MESOS-8976
> URL: https://issues.apache.org/jira/browse/MESOS-8976
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: LaunchDuplicateOfferLost.jenkins-faillog
>
>
> In an internal CI run, we observed a failure with this test where the 
> scheduler seemed to be stuck repeatedly allocating resources to the agent for 
> about 1 hour before getting timed out. See attached log for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8770) Use Python3 for Mesos support scripts

2018-08-29 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596502#comment-16596502
 ] 

Kevin Klues commented on MESOS-8770:


{noformat}
commit b55c5deb88b4a4e5713c9361e492e941972fe8db
Author: Kevin Klues 
Date:   Wed Aug 29 13:39:29 2018 +0200

Updated the python2 'PyLinter' to only lint python2 based code.

Specifically, this includes no longer linting all code under the
`src/python` directory.

Review: https://reviews.apache.org/r/68560
{noformat}

> Use Python3 for Mesos support scripts
> -
>
> Key: MESOS-8770
> URL: https://issues.apache.org/jira/browse/MESOS-8770
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Armand Grillet
>Priority: Major
>
> Our Python scripts under {{support/}} currently implicitly assume that 
> developers have a python2 environment as their primary Python installation.
> We should consider updating these scripts so that they can be used with a 
> python3 installation as well. There exist [some 
> resources|http://python-future.org/overview.html#automatic-conversion-to-py2-3-compatible-code]
>  on the web documenting best practices and tools for automatic rewrites which 
> should get us a long way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258
 ] 

Alexander Rukletsov edited comment on MESOS-8545 at 8/29/18 3:01 PM:
-

When the agent handles {{ATTACH_CONTAINER_INPUT}} call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of {{ConnectionProcess}} is 
created, which calls 
[{{ConnectionProcess::read()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
{{ConnectionProcess}} calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a {{Response}} promise. This leads to responding back (to the 
{{AttachInputToNestedContainerSession}} 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an {{HTTP 500}} error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls {{terminate(self(), 
false)}} (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, {{IOSwitchboardServerProcess::finalize()}} sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
{{main()}}|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment which includes steps to reproduce the bug.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 

[jira] [Comment Edited] (MESOS-9162) Unkillable pod container stuck in ISOLATING

2018-08-29 Thread A. Dukhovniy (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596432#comment-16596432
 ] 

A. Dukhovniy edited comment on MESOS-9162 at 8/29/18 2:51 PM:
--

We had another manifestation of the unkillable pod task, however in a different 
test. There is one thing that those tests all do - 
1. they start  a pod
2. kill it (either on the agent directly or through the Marathon)
3. restart the pod

The test *never fails in step1* but *always in step 3*. Also, all tests use the 
same pod definition *which uses volume*.

I'll not attach new logs since the symptoms (container stuck in ISOLATING) look 
the same.
/cc [~abudnik], [~gilbert], [~jieyu]


was (Author: zen-dog):
We had another manifestation of the unkillable pod task, however in a different 
test. There is one thing that those tests all do - 
1. they start  a pod
2. kill it (either on the agent directly or through the Marathon)
3. restart the pod

The test *never fails in step1* but *always in step 3*. Also, all tests use the 
same pod definition *which uses volume*.

I'll not attach new logs since they symptoms (container stuck in ISOLATING) 
look the same.
/cc [~abudnik] [~gilbert] [~jieyu]

> Unkillable pod container stuck in ISOLATING
> ---
>
> Key: MESOS-9162
> URL: https://issues.apache.org/jira/browse/MESOS-9162
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0, 1.7.0
>Reporter: A. Dukhovniy
>Assignee: Gilbert Song
>Priority: Major
>  Labels: container-stuck
> Attachments: dcos-marathon.service.log, dcos-mesos-master.service.gz, 
> dcos-mesos-slave.service.gz, diagnostics.zip, 
> sandbox_10_10_0_222_var_lib.tar.gz
>
>
> We have a simple test that launches a pod with two containers (one writes in 
> a file and the other reads it). This test is flaky because the container 
> sometimes fails to start.
> Marathon app definition:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
> "kind": "fixed",
> "instances": 1
>   },
>   "environment": {
> "PING": "PONG"
>   },
>   "containers": [
> {
>   "name": "ct1",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "image": {
> "kind": "DOCKER",
> "id": "busybox"
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "test-v1"
> }
>   ]
> },
> {
>   "name": "ct2",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "etc"
> },
> {
>   "name": "v2",
>   "mountPath": "docker"
> }
>   ]
> }
>   ],
>   "networks": [
> {
>   "mode": "host"
> }
>   ],
>   "volumes": [
> {
>   "name": "v1"
> },
> {
>   "name": "v2",
>   "host": "/var/lib/docker"
> }
>   ]
> }
> {code}
> During the test, Marathon tries to launch the pod but doesn't receive a 
> {{TASK_RUNNING}} for the first container and so after 2min decides to kill 
> the pod which also fails. 
> Agent sandbox (attached to this ticket minus docker layers, since they're too 
> big to attach) shows that one of the containers wasn't started properly - the 
> last line in the agent log says:
> {code}
> Transitioning the state of container 
> ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 
> from PREPARING to ISOLATING
> {code}
> Until then the log looks pretty unspektakular. 
> Afterwards, Marathon tries to kill the container repeatedly, but doesn't 
> succeed - the executor receives the reuests but doesn't send anything back:
> {code}
> I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED 
> event
> I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on 
> 10.10.0.222
> I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP 
> event
> I0816 22:52:53.11651611 default_executor.cpp:428] Setting 
> 'MESOS_CONTAINER_IP' to: 10.10.0.222
> I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:52:53.19441610 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task 
> 

[jira] [Comment Edited] (MESOS-9162) Unkillable pod container stuck in ISOLATING

2018-08-29 Thread A. Dukhovniy (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596432#comment-16596432
 ] 

A. Dukhovniy edited comment on MESOS-9162 at 8/29/18 2:50 PM:
--

We had another manifestation of the unkillable pod task, however in a different 
test. There is one thing that those tests all do - 
1. they start  a pod
2. kill it (either on the agent directly or through the Marathon)
3. restart the pod

The test *never fails in step1* but *always in step 3*. Also, all tests use the 
same pod definition *which uses volume*.

I'll not attach new logs since they symptoms (container stuck in ISOLATING) 
look the same.
/cc [~abudnik] [~gilbert] [~jieyu]


was (Author: zen-dog):
We had another manifestation of the unkillable pod task, however in a different 
test. There is one thing that those tests all do - 
1. they start  a pod
2. kill it (either on the agent directly or through the Marathon)
3. restart the pod

The test **never fails in step1** but *always in step 3*. Also, all tests use 
the same pod definition **which uses volume**.

I'll not attach new logs since they symptoms (container stuck in ISOLATING) 
look the same.
/cc [~abudnik] [~gilbert] [~jieyu]

> Unkillable pod container stuck in ISOLATING
> ---
>
> Key: MESOS-9162
> URL: https://issues.apache.org/jira/browse/MESOS-9162
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0, 1.7.0
>Reporter: A. Dukhovniy
>Assignee: Gilbert Song
>Priority: Major
>  Labels: container-stuck
> Attachments: dcos-marathon.service.log, dcos-mesos-master.service.gz, 
> dcos-mesos-slave.service.gz, diagnostics.zip, 
> sandbox_10_10_0_222_var_lib.tar.gz
>
>
> We have a simple test that launches a pod with two containers (one writes in 
> a file and the other reads it). This test is flaky because the container 
> sometimes fails to start.
> Marathon app definition:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
> "kind": "fixed",
> "instances": 1
>   },
>   "environment": {
> "PING": "PONG"
>   },
>   "containers": [
> {
>   "name": "ct1",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "image": {
> "kind": "DOCKER",
> "id": "busybox"
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "test-v1"
> }
>   ]
> },
> {
>   "name": "ct2",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "etc"
> },
> {
>   "name": "v2",
>   "mountPath": "docker"
> }
>   ]
> }
>   ],
>   "networks": [
> {
>   "mode": "host"
> }
>   ],
>   "volumes": [
> {
>   "name": "v1"
> },
> {
>   "name": "v2",
>   "host": "/var/lib/docker"
> }
>   ]
> }
> {code}
> During the test, Marathon tries to launch the pod but doesn't receive a 
> {{TASK_RUNNING}} for the first container and so after 2min decides to kill 
> the pod which also fails. 
> Agent sandbox (attached to this ticket minus docker layers, since they're too 
> big to attach) shows that one of the containers wasn't started properly - the 
> last line in the agent log says:
> {code}
> Transitioning the state of container 
> ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 
> from PREPARING to ISOLATING
> {code}
> Until then the log looks pretty unspektakular. 
> Afterwards, Marathon tries to kill the container repeatedly, but doesn't 
> succeed - the executor receives the reuests but doesn't send anything back:
> {code}
> I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED 
> event
> I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on 
> 10.10.0.222
> I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP 
> event
> I0816 22:52:53.11651611 default_executor.cpp:428] Setting 
> 'MESOS_CONTAINER_IP' to: 10.10.0.222
> I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:52:53.19441610 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task 
> 

[jira] [Commented] (MESOS-9162) Unkillable pod container stuck in ISOLATING

2018-08-29 Thread A. Dukhovniy (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596432#comment-16596432
 ] 

A. Dukhovniy commented on MESOS-9162:
-

We had another manifestation of the unkillable pod task, however in a different 
test. There is one thing that those tests all do - 
1. they start  a pod
2. kill it (either on the agent directly or through the Marathon)
3. restart the pod

The test **never fails in step1** but *always in step 3*. Also, all tests use 
the same pod definition **which uses volume**.

I'll not attach new logs since they symptoms (container stuck in ISOLATING) 
look the same.
/cc [~abudnik] [~gilbert] [~jieyu]

> Unkillable pod container stuck in ISOLATING
> ---
>
> Key: MESOS-9162
> URL: https://issues.apache.org/jira/browse/MESOS-9162
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0, 1.7.0
>Reporter: A. Dukhovniy
>Assignee: Gilbert Song
>Priority: Major
>  Labels: container-stuck
> Attachments: dcos-marathon.service.log, dcos-mesos-master.service.gz, 
> dcos-mesos-slave.service.gz, diagnostics.zip, 
> sandbox_10_10_0_222_var_lib.tar.gz
>
>
> We have a simple test that launches a pod with two containers (one writes in 
> a file and the other reads it). This test is flaky because the container 
> sometimes fails to start.
> Marathon app definition:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
> "kind": "fixed",
> "instances": 1
>   },
>   "environment": {
> "PING": "PONG"
>   },
>   "containers": [
> {
>   "name": "ct1",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "image": {
> "kind": "DOCKER",
> "id": "busybox"
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "test-v1"
> }
>   ]
> },
> {
>   "name": "ct2",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "etc"
> },
> {
>   "name": "v2",
>   "mountPath": "docker"
> }
>   ]
> }
>   ],
>   "networks": [
> {
>   "mode": "host"
> }
>   ],
>   "volumes": [
> {
>   "name": "v1"
> },
> {
>   "name": "v2",
>   "host": "/var/lib/docker"
> }
>   ]
> }
> {code}
> During the test, Marathon tries to launch the pod but doesn't receive a 
> {{TASK_RUNNING}} for the first container and so after 2min decides to kill 
> the pod which also fails. 
> Agent sandbox (attached to this ticket minus docker layers, since they're too 
> big to attach) shows that one of the containers wasn't started properly - the 
> last line in the agent log says:
> {code}
> Transitioning the state of container 
> ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 
> from PREPARING to ISOLATING
> {code}
> Until then the log looks pretty unspektakular. 
> Afterwards, Marathon tries to kill the container repeatedly, but doesn't 
> succeed - the executor receives the reuests but doesn't send anything back:
> {code}
> I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED 
> event
> I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on 
> 10.10.0.222
> I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP 
> event
> I0816 22:52:53.11651611 default_executor.cpp:428] Setting 
> 'MESOS_CONTAINER_IP' to: 10.10.0.222
> I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:52:53.19441610 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task 
> 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1'
> I0816 22:54:50.559737 4 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559751 4 default_executor.cpp:1251] Received kill for task 
> 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2'
> ...
> {code}
> Relevant Ids for grepping the logs:
> {code}
> Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9
> Mesos tasks id: 
> 

[jira] [Commented] (MESOS-8972) when choose docker image use user network all mesos agent crash

2018-08-29 Thread Alan Silva (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596388#comment-16596388
 ] 

Alan Silva commented on MESOS-8972:
---

I'm having the same issue.

This feature is very important for one of our customers, is there an ETA on a 
fix?

I'm more than willing to build Mesos from source if a fix commit exists in one 
of the current branches.

> when choose docker image use user network all mesos agent crash
> ---
>
> Key: MESOS-8972
> URL: https://issues.apache.org/jira/browse/MESOS-8972
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.7.0
> Environment: Ubuntu 14.04 & Ubuntu 16.04, both type crashes mesos
>Reporter: saturnman
>Priority: Major
>  Labels: docker, network
>
> When submit docker task from marathon choose user network, then mesos process 
> crashes with the following backtrace message
> mesos-agent: .././../3rdparty/stout/include/stout/option.hpp:118: const T& 
> Option::get() const & [with T = std::__cxx11::basic_string]: 
> Assertion `isSome()' failed.
> *** Aborted at 1527797505 (unix time) try "date -d @1527797505" if you are 
> using GNU date ***
> PC: @ 0x7fc03d43f428 (unknown)
> *** SIGABRT (@0x4514) received by PID 17684 (TID 0x7fc033143700) from PID 
> 17684; stack trace: ***
>  @ 0x7fc03dd7d390 (unknown)
>  @ 0x7fc03d43f428 (unknown)
>  @ 0x7fc03d44102a (unknown)
>  @ 0x7fc03d437bd7 (unknown)
>  @ 0x7fc03d437c82 (unknown)
>  @ 0x564f1ad8871d 
> _ZNKR6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc3getEv
>  @ 0x7fc048c43256 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
>  @ 0x7fc048c368cb mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
>  @ 0x7fc0486e5c18 
> _ZZN7process8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS2_8internal5slave20MesosIsolatorProcessERKNS2_11ContainerIDERKNS3_15ContainerConfigESB_SE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_T2_EOT3_OT4_ENKUlSt10unique_ptrINS_7PromiseIS5_EESt14default_deleteISX_EEOS9_OSC_PNS_11ProcessBaseEE_clES10_S11_S12_S14_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258
 ] 

Andrei Budnik edited comment on MESOS-8545 at 8/29/18 12:42 PM:


When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[[1]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[[2]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
`main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment which includes steps to reproduce the bug.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258
 ] 

Andrei Budnik edited comment on MESOS-8545 at 8/29/18 12:43 PM:


When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
`main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment which includes steps to reproduce the bug.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[[1]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[[2]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 

[jira] [Commented] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596272#comment-16596272
 ] 

Andrei Budnik commented on MESOS-8545:
--

`libprocess::finalize()` solves the problem, because it waits for termination 
of all libprocess actors (including `HttpProxy`) in 
`[ProcessManager::finalize()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/process.cpp#L2395-L2420]`.
 This guarantees that all responses are sent back to the agent before 
IOSwitchboard exits from its `main()` function.

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0, 1.6.1, 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258
 ] 

Andrei Budnik commented on MESOS-8545:
--

When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[[1]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[[2]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
`main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment including steps to reproduce.

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0, 1.6.1, 1.7.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-08-29 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596258#comment-16596258
 ] 

Andrei Budnik edited comment on MESOS-8545 at 8/29/18 12:20 PM:


When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 
[`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308],
 which [unblocks 
`main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150]
 function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet 
[written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699]
 response messages to the socket. So, if any delay occurs before 
[sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748]
 the response back to the agent, the socket will be closed due to IOSwitchboard 
process termination. That leads to the aforementioned premature socket close in 
the agent.

See my previous comment including steps to reproduce.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP 
[streaming 
connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104]
 to IOSwitchboard.
 After the agent 
[sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141]
 a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, 
which calls 
[`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220]
 to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the 
`ConnectionProcess` calls 
`[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`,
 which in turn [flushes 
`pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201]
 containing a `Response` promise. This leads to responding back (to the 
`AttachInputToNestedContainerSession` 
[test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943])
 an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), 
false)` (here 
[[1]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262]
 or there 
[[2]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]).
 Then, `IOSwitchboardServerProcess::finalize()` sets a value to the 

[jira] [Assigned] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2018-08-29 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-4233:
--

Assignee: (was: Kapil Arya)

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>Priority: Major
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
> envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
> mem(*​):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part 
> of normal operation and maybe should be logged at info / debug levels, but 
> not to a sysadmin (Just show when things fail, and maybe aggregate counters 
> to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would 
> prevent the giant port list attached, make that easily discoverable / bug 
> filable / fixable) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9189) Include 'Connection: close' header in streaming API responses.

2018-08-29 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596183#comment-16596183
 ] 

Alexander Rukletsov commented on MESOS-9189:


I'm not sure I understand how the change is supposed to help. {{'Connection: 
close'}} set by a server is an indicator for the client to close the connection 
_after_ receiveng the complete response. AFAIK, we don't ever complete the 
streaming response in Mesos and  there is no way for Mesos to somehow 
understand that an end client might not be interested in the stream any more 
and send an empty chunk. From a middleman's point of view the actual value of 
the {{'Connection'}} header is only interesting _after_ the response is 
completed, i.e., an empty chunk has been received, which, IIRC, never happens 
in our case.

Is the hope here is that some middlemen peek into the {{'Connection'}} header 
and based on it decide whether to close the connection themselves when their 
client disconnects even though the response might not be completed?

> Include 'Connection: close' header in streaming API responses.
> --
>
> Key: MESOS-9189
> URL: https://issues.apache.org/jira/browse/MESOS-9189
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> We've seen some HTTP intermediaries (e.g. ELB) decide to re-use connections 
> to mesos as an optimization to avoid re-connection overhead. As a result, 
> when the end-client of the streaming API disconnects from the intermediary, 
> the intermediary leaves the connection to mesos open in an attempt to re-use 
> the connection for another request once the response completes. Mesos then 
> thinks that the subscriber never disconnected and the intermediary happily 
> continues to read the streaming events even though there's no end-client.
> To help indicate to intermediaries that the connection SHOULD NOT be re-used, 
> we can set the 'Connection: close' header for streaming API responses. It may 
> not be respected (since the language seems to be SHOULD NOT), but some 
> intermediaries may respect it and close the connection if the end-client 
> disconnects.
> Note that libprocess' http server currently doesn't close the the connection 
> based on a handler setting this header, but it doesn't matter here since the 
> streaming API responses are infinite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8957) Install Python 3 on Mesos CI instances

2018-08-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/MESOS-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Gögge reassigned MESOS-8957:
--

Assignee: Robin Gögge  (was: Armand Grillet)

> Install Python 3 on Mesos CI instances
> --
>
> Key: MESOS-8957
> URL: https://issues.apache.org/jira/browse/MESOS-8957
> Project: Mesos
>  Issue Type: Task
>Reporter: Armand Grillet
>Assignee: Robin Gögge
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7076) libprocess tests fail when using libevent 2.1.8

2018-08-29 Thread Alexander Rojas (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596019#comment-16596019
 ] 

Alexander Rojas commented on MESOS-7076:


[~tillt] if you check the mailing list of the conversation I had with the guys 
from libevent, they are willing to help us debug the issue, what they request 
is a container with a debug version of libevent and a compiled mesos test that 
he can use to debug. I was timeboxed when I worked on this, so I had to stop 
but perhaps we can try to get their help again?

> libprocess tests fail when using libevent 2.1.8
> ---
>
> Key: MESOS-7076
> URL: https://issues.apache.org/jira/browse/MESOS-7076
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess, test
> Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew)
>Reporter: Jan Schlicht
>Assignee: Till Toenshoff
>Priority: Critical
>  Labels: ci
>
> Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent 
> --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests 
> fail like
> {noformat}
> [ RUN  ] SSLTest.SSLSocket
> I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is 
> unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path 
> unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer 
> certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure
> Failed to wait 15secs for Socket(socket.get()).recv()
> [  FAILED  ] SSLTest.SSLSocket (15196 ms)
> {noformat}
> Tests failing are
> {noformat}
> SSLTest.SSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ECDHESupport
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> SSLTest.SilentSocket
> SSLTest.ShutdownThenSend
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false"
> SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9191) Docker command executor may stuck at infinite unkillable loop.

2018-08-29 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9191:
---

 Summary: Docker command executor may stuck at infinite unkillable 
loop.
 Key: MESOS-9191
 URL: https://issues.apache.org/jira/browse/MESOS-9191
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Reporter: Gilbert Song


Due to the change from https://issues.apache.org/jira/browse/MESOS-8574, the 
behavior of docker command executor to discard the future of docker stop was 
changed. If there is a new killTask() invoked and there is an existing docker 
stop in pending state, the old one would call discard and then execute the new 
one. This is ok for most of cases.

However, docker stop could take long (depends on grace period and whether the 
application could handle SIGTERM). If the framework retry killTask more 
frequently than grace period (depends on killpolicy API, env var, or agent 
flags), then the executor may be stuck forever with unkillable tasks. Because 
everytime before the docker stop finishes, the future of docker stop is 
discarded by the new incoming killTask.

We should consider re-use grace period before calling discard() to a pending 
docker stop future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)