[jira] [Commented] (MESOS-9348) URL-encoded HDFS artifacts can't be fetched through the cache.

2018-10-22 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659865#comment-16659865
 ] 

James Peach commented on MESOS-9348:


One approach here is to URL-encode the output filename for the HDFS command. 
Experimentally, it looks like this is required, since the command errors out on 
unsafe characters:

{noformat}
# hdfs dfs -copyToLocal 
hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
 $(pwd)/%255B-jpeach-].jar
copyToLocal: unexpected URISyntaxException
{noformat}

> URL-encoded HDFS artifacts can't be fetched through the cache.
> --
>
> Key: MESOS-9348
> URL: https://issues.apache.org/jira/browse/MESOS-9348
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: James Peach
>Priority: Major
>
> The {{hdfs dfs}} command always does a URI decode on the target output file. 
> This means that the output file gets stored in the fetcher cache under the 
> wrong filename and we can never retrieve it.
> Here's an example of how the command behaves:
> {noformat}
> [/tmp]# hdfs dfs -copyToLocal 
> hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
>  $(pwd)/%5B-jpeach-%5D.jar
> [/tmp]# ls -l *jpeach*
> -rw-r--r-- 1 root root 7285799 Oct 22 23:29 [-jpeach-].jar
> {noformat}
> Here's how this plays out in the fetcher:
> {noformat}
> W1022 23:22:13.649587 3186459 fetcher.cpp:395] Copying instead of extracting 
> resource from URI with 'extract' flag, because it does not seem to be an 
> archive: 
> hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
> cp: cannot stat `/srv/mesos/fetch/jarvis/c67-connector-_ASE%5D.jar': No such 
> file or directory
> E1022 23:22:13.652987 3186459 fetcher.cpp:613] EXIT with status 1: Failed to 
> fetch 
> 'hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar':
>  cp failed with status: 256
> ...
> # ls -latr /srv/mesos/fetch
> ...
> -rw-r--r-- 1 jarvis jarvis   7285799 Oct 22 23:22 c67-connector-_ASE].jar
> {noformat}
> The fetcher has downloaded the artifact into the cache, but can't copy it 
> into the sandbox because it was downloaded to the wrong filename.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9348) URL-encoded HDFS artifacts can't be fetched through the cache.

2018-10-22 Thread James Peach (JIRA)
James Peach created MESOS-9348:
--

 Summary: URL-encoded HDFS artifacts can't be fetched through the 
cache.
 Key: MESOS-9348
 URL: https://issues.apache.org/jira/browse/MESOS-9348
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Reporter: James Peach


The {{hdfs dfs}} command always does a URI decode on the target output file. 
This means that the output file gets stored in the fetcher cache under the 
wrong filename and we can never retrieve it.

Here's an example of how the command behaves:
{noformat}
[/tmp]# hdfs dfs -copyToLocal 
hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
 $(pwd)/%5B-jpeach-%5D.jar

[/tmp]# ls -l *jpeach*
-rw-r--r-- 1 root root 7285799 Oct 22 23:29 [-jpeach-].jar
{noformat}

Here's how this plays out in the fetcher:
{noformat}
W1022 23:22:13.649587 3186459 fetcher.cpp:395] Copying instead of extracting 
resource from URI with 'extract' flag, because it does not seem to be an 
archive: 
hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
cp: cannot stat `/srv/mesos/fetch/jarvis/c67-connector-_ASE%5D.jar': No such 
file or directory
E1022 23:22:13.652987 3186459 fetcher.cpp:613] EXIT with status 1: Failed to 
fetch 
'hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar':
 cp failed with status: 256
...
# ls -latr /srv/mesos/fetch
...
-rw-r--r-- 1 jarvis jarvis   7285799 Oct 22 23:22 c67-connector-_ASE].jar
{noformat}

The fetcher has downloaded the artifact into the cache, but can't copy it into 
the sandbox because it was downloaded to the wrong filename.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8335) ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 fails on Debian 9 and CentOS 6.

2018-10-22 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659748#comment-16659748
 ] 

Till Toenshoff commented on MESOS-8335:
---

Removed the chosen workaround from internal CI while realising this was never 
put into our mesos-build images - hence closing it again - nothing new to see 
here...

> ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 fails on Debian 9  
> and CentOS 6.
> -
>
> Key: MESOS-8335
> URL: https://issues.apache.org/jira/browse/MESOS-8335
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Armand Grillet
>Priority: Major
> Attachments: centos-6-curl-7.19.7.txt, centos-6-curl-7.57.txt
>
>
> Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
> Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
> OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
> libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
> Error:
> {code}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' 
> authorizer
> I1215 00:09:28.697144 30867 master.cpp:456] Master 
> 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started 
> on 127.0.1.1:35029
> I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
> --zk_session_timeout="10secs"
> I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing 
> authenticated frameworks to register
> I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing 
> authenticated agents to register
> I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing 
> authenticated HTTP frameworks to register
> I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/4RYdF1/credentials'
> I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
> authenticator
> I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
> I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
> allocator process
> I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
> I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
> I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
> I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
> I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 507904ns
> I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 
> 28977ns; attempting to update the registry
> I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the 
> registry in 464896ns
> I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered 
> registrar
> I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the 
> registry (167B); allowing 10mins for agents to re-register
> I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W1215 00:09:28.706816 19343 

[jira] [Commented] (MESOS-9176) Mesos does not work properly on modern Ubuntu distributions.

2018-10-22 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659732#comment-16659732
 ] 

Till Toenshoff commented on MESOS-9176:
---

MESOS-8368 should not be strictly needed for supporting modern linux 
distributions - removing it from this epic.

> Mesos does not work properly on modern Ubuntu distributions.
> 
>
> Key: MESOS-9176
> URL: https://issues.apache.org/jira/browse/MESOS-9176
> Project: Mesos
>  Issue Type: Epic
>Affects Versions: 1.7.0
> Environment: Ubuntu 17.10
> Ubuntu 18.04
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: integration, mesosphere
>
> We have observed several issues in various components on modern Ubuntus, 
> e.g., 17.10, 18.04. Needless to say, we need to ensure Mesos compiles and 
> runs fine on those distros.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8907) Docker image fetcher fails with HTTP/2.

2018-10-22 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659709#comment-16659709
 ] 

Till Toenshoff commented on MESOS-8907:
---

{noformat}
commit bc55489a18fdaf6c79945704d5c70984cab87c11
Author: Till Toenshoff 
Date:   Mon Oct 22 21:42:53 2018 +0200

Updated docker image fetcher to enforce HTTP 1.1 where needed.

Modifies the 'curl' invocation that is returning an http::Response,
locking it into HTTP 1.1. Our current HTTP parser is unable to process
HTTP 2 responses.

With the advent of curl 7.47, HTTPS connections are being enforced
towards HTTP 2 rather aggressively. As a result, our image fetcher
fails when recent curl versions are being used for pulling images from
a registry that supports HTTP 2.

HTTP 1.1 is chosen as long as the underlying curl supports the
'--http1.1' flag. If curl is old enough to not support that flag, we
can deduct that it will not enforce HTTP 2 and therefore need no
further actions.

For allowing all the benefits of HTTP 2 where possible, we do not
adapt any 'curl' invocations that do not attempt to parse headers.

Review: https://reviews.apache.org/r/69075/
{noformat}

> Docker image fetcher fails with HTTP/2.
> ---
>
> Key: MESOS-8907
> URL: https://issues.apache.org/jira/browse/MESOS-8907
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 1.5.1, 1.6.1, 1.7.0, 1.8.0
>Reporter: James Peach
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: integration
> Fix For: 1.6.2, 1.7.1, 1.5.3
>
>
> {noformat}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> ...
> I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
> 'quay.io/coreos/alpine-sh' from 
> 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
> '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
> E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
> '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
> 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
> 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
> HTTP responses: Decoding failed
> HTTP/2 200
> server: nginx/1.13.12
> date: Fri, 11 May 2018 03:52:00 GMT
> content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
> content-length: 4486
> docker-content-digest: 
> sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> ...
> {noformat}
> Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
> curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
> prepared to parse that.
> {noformat}
> $ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
> HTTP/1.1 301 Moved Permanently
> Content-Type: text/html
> Date: Fri, 11 May 2018 04:07:44 GMT
> Location: https://quay.io/coreos/alpine-sh?latest
> Server: nginx/1.13.12
> Content-Length: 186
> Connection: keep-alive
> HTTP/2 301
> server: nginx/1.13.12
> date: Fri, 11 May 2018 04:07:45 GMT
> content-type: text/html; charset=utf-8
> content-length: 287
> location: https://quay.io/coreos/alpine-sh/?latest
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9347) Add test abstraction for the master operator stream

2018-10-22 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9347:


 Summary: Add test abstraction for the master operator stream
 Key: MESOS-9347
 URL: https://issues.apache.org/jira/browse/MESOS-9347
 Project: Mesos
  Issue Type: Improvement
Reporter: Greg Mann


Adding a test abstraction around the master operator API would make it easier 
for developers to test this functionality, increasing the likelihood that it 
will be included in new test code going forward.

For example, I can imagine an RAII-style construct which creates an operator 
event stream upon instantiation and captures the initial GET_STATE response, 
exposing a method to allow callers to pop events off of the stream 
subsequently. This could be used in the existing {{MasterAPITest.Subscribe}}, 
as well as other tests in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8907) curl fetcher fails with HTTP/2

2018-10-22 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659211#comment-16659211
 ] 

Till Toenshoff commented on MESOS-8907:
---

[~Kirill P] the problem is our current HTTP parser. Only when we try to feed an 
HTTP result into the parser that in fact was HTTP/2, we hit the described bug. 
We therefore need to fix all HTTP client invocations that try to feed the HTTP 
responses into the HTTP parser.
The docker fetcher which does this does not use libcurl but instead calls out 
to the {{curl}} command line tool.
 



> curl fetcher fails with HTTP/2
> --
>
> Key: MESOS-8907
> URL: https://issues.apache.org/jira/browse/MESOS-8907
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 1.5.1, 1.6.1, 1.7.0, 1.8.0
>Reporter: James Peach
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: integration
>
> {noformat}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> ...
> I0510 20:52:00.209815 25010 registry_puller.cpp:287] Pulling image 
> 'quay.io/coreos/alpine-sh' from 
> 'docker-manifest://quay.iocoreos/alpine-sh?latest#https' to 
> '/tmp/ImageAlpine_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_2_wF7EfM/store/docker/staging/qit1Jn'
> E0510 20:52:00.756072 25003 slave.cpp:6176] Container 
> '5eb869c5-555c-4dc9-a6ce-ddc2e7dbd01a' for executor 
> 'ad9aa898-026e-47d8-bac6-0ff993ec5904' of framework 
> 7dbe7cd6-8ffe-4bcf-986a-17ba677b5a69- failed to start: Failed to decode 
> HTTP responses: Decoding failed
> HTTP/2 200
> server: nginx/1.13.12
> date: Fri, 11 May 2018 03:52:00 GMT
> content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
> content-length: 4486
> docker-content-digest: 
> sha256:61bd5317a92c3213cfe70e2b629098c51c50728ef48ff984ce929983889ed663
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> ...
> {noformat}
> Note that curl is saying the HTTP version is "HTTP/2". This happens on modern 
> curl that automatically negotiates HTTP/2, but the docker fetcher isn't 
> prepared to parse that.
> {noformat}
> $ curl -i --raw -L -s -S -o -  'http://quay.io/coreos/alpine-sh?latest#https'
> HTTP/1.1 301 Moved Permanently
> Content-Type: text/html
> Date: Fri, 11 May 2018 04:07:44 GMT
> Location: https://quay.io/coreos/alpine-sh?latest
> Server: nginx/1.13.12
> Content-Length: 186
> Connection: keep-alive
> HTTP/2 301
> server: nginx/1.13.12
> date: Fri, 11 May 2018 04:07:45 GMT
> content-type: text/html; charset=utf-8
> content-length: 287
> location: https://quay.io/coreos/alpine-sh/?latest
> x-frame-options: DENY
> strict-transport-security: max-age=63072000; preload
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

2018-10-22 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9334:
-

Shepherd: Gilbert Song
Assignee: Qian Zhang
  Sprint: Mesosphere RI-6 Sprint 2018-31
Story Points: 5

> Container stuck at ISOLATING state due to libevent poll never returns
> -
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] 
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted 
> '/proc/5244/ns/net' to 
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' 
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] 
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
>  In the above logs, the state of container 
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at 
> 09:13:23, but did not transitioned to any other states until it was destroyed 
> due to the executor registration timeout (10 mins). And the destroy can never 
> complete since it needs to wait for the container to finish isolating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9346) Support for Block I/O latency cgroup controller

2018-10-22 Thread Simao Reis (JIRA)
Simao Reis created MESOS-9346:
-

 Summary: Support for Block I/O latency cgroup controller
 Key: MESOS-9346
 URL: https://issues.apache.org/jira/browse/MESOS-9346
 Project: Mesos
  Issue Type: Wish
Reporter: Simao Reis


Kernel 4.19 release adds a new controller that attempts to guarantee minimum 
I/O latency targets for cgroups.

As long as everybody is meeting their latency target the controller doesn't do 
anything, but once a group starts missing its target it will attempt to 
maintain average IO latencies below the configured latency target, throttling 
anybody with a higher latency target than the victimized group. For more 
details see the 
[documentation|https://git.kernel.org/linus/b351f0c76c3eb94c9ccfb68d0b23899a35e47f27].

This would be useful, for example, to protect etcd workloads running in a Mesos 
UCR container. Please refer to etcd Disk tuning section in this 
[link|https://coreos.com/etcd/docs/latest/tuning.html] for more information. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9341) Add non-interactive test(s) for `mesos task exec`

2018-10-22 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659030#comment-16659030
 ] 

Kevin Klues commented on MESOS-9341:


{noformat}
commit 06e5a3abe0f6241a06ca89ac6071b58556a9f07b
Author: Armand Grillet 
Date:   Mon Oct 22 10:18:29 2018 -0400

Added non-interactive test for 'task exec'.

Review: https://reviews.apache.org/r/69049/
{noformat}

> Add non-interactive test(s) for `mesos task exec`
> -
>
> Key: MESOS-9341
> URL: https://issues.apache.org/jira/browse/MESOS-9341
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9341) Add non-interactive test(s) for `mesos task exec`

2018-10-22 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659027#comment-16659027
 ] 

Kevin Klues commented on MESOS-9341:


{noformat}
commit 4624823deb37df5469e1d1e548945985b59a8a73
Author: Armand Grillet 
Date:   Mon Oct 22 10:13:33 2018 -0400

Added new CLI constants 'TEST_DIRECTORY' and 'TEST_DATA_DIRECTORY'.

Review: https://reviews.apache.org/r/69119/
{noformat}

> Add non-interactive test(s) for `mesos task exec`
> -
>
> Key: MESOS-9341
> URL: https://issues.apache.org/jira/browse/MESOS-9341
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

2018-10-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658795#comment-16658795
 ] 

Qian Zhang edited comment on MESOS-9334 at 10/22/18 2:14 PM:
-

I added some logs into `libevent_poll.cpp` (see the diff below for details) and 
reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast(arg);
 
+  LOG(INFO) << "==pollCallback starts with fd " << fd
+<< " and with poll " << poll << "==";
+
   if (poll->promise.future().hasDiscard()) {
+LOG(INFO) << "==pollCallback discards with fd "
+  << fd << "==";
+
 poll->promise.discard();
   } else {
 // Convert libevent specific EV_READ / EV_WRITE to io::* specific
@@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg)
 short events =
   ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0);
 
+LOG(INFO) << "==pollCallback sets promise with fd " << fd
+  << " and with events " << events << "==";
+
 poll->promise.set(events);
   }
 
   // Deleting the `poll` also destructs `ev` and hence triggers `event_free`,
   // which makes the event non-pending.
   delete poll;
+
+  LOG(INFO) << "==pollCallback ends with fd " << fd << "==";
 }
 
 
 void pollDiscard(const std::weak_ptr& ev, short events)
 {
+  LOG(INFO) << "==pollDiscard is called==";
+
   // Discarding inside the event loop prevents `pollCallback()` from being
   // called twice if the future is discarded.
   run_in_event_loop([=]() {
@@ -78,6 +93,9 @@ Future poll(int_fd fd, short events)
 
   internal::Poll* poll = new internal::Poll();
 
+  LOG(INFO) << "==libevent starts polling with fd " << fd
+<< " and with poll " << poll << "==";
+
   Future future = poll->promise.future();
 
   // Convert io::READ / io::WRITE to libevent specific values of these
{code}
Here is what I found in the agent log when this issue occurred (fd 48 is the 
stderr file descriptor of `NetworkCniIsolatorSetup`):
{code:java}
I1021 15:57:45.00 2116 libevent_poll.cpp:96] ==libevent starts 
polling with fd 48 and with poll 0x7f60df029eb0==
I1021 15:57:45.00 2117 libevent_poll.cpp:41] ==pollCallback starts 
with fd 48 and with poll 0x7f60e6e56c70==
I1021 15:57:45.00 2117 libevent_poll.cpp:45] ==pollCallback 
discards with fd 48==
I1021 15:57:45.00 2117 libevent_poll.cpp:65] ==pollCallback ends 
with fd 48==
{code}
We can see libevent started to poll fd 48 with the poll object 0x7f60df029eb0, 
but when `pollCallback` was called for fd 48, the poll object is different 
(0x7f60e6e56c70) which has been discarded (see the third log line) !!! 

And when I searched 0x7f60e6e56c70 in the agent log, I found:
{code:java}
I1021 15:57:22.00 2115 memory.cpp:478] Started listening for OOM events for 
container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5
I1021 15:57:22.00 2115 libevent_poll.cpp:96] ==libevent starts 
polling with fd 48 and with poll 0x7f60e6e56c70==
{code}
So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file 
descriptor (fd 48) which is used to listen OOM events for another container 
(4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 
15:57:44 right before agent started to wait the stderr of 
`NetworkCniIsolatorSetup` (15:57:45).
{code:java}
I1021 15:57:44.00  2114 containerizer.cpp:2459] Destroying container 
4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state
{code}
I reproduced this issue a couple of times, the observations I had from agent 
log are always same as the above.


was (Author: qianzhang):
I added some logs into `libevent_poll.cpp` (see the diff below for details) and 
reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast(arg);
 
+  LOG(INFO) << "==pollCallback starts with fd " << fd
+<< " and with poll " << poll << "==";
+
   if (poll->promise.future().hasDiscard()) {
+LOG(INFO) << "==pollCallback discards with fd "
+  << fd << "==";
+
 poll->promise.discard();
   } else {
 // Convert libevent specific EV_READ / EV_WRITE to io::* 

[jira] [Commented] (MESOS-9341) Add non-interactive test(s) for `mesos task exec`

2018-10-22 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658986#comment-16658986
 ] 

Kevin Klues commented on MESOS-9341:


{noformat}
commit aed4a743daab3c9e5aacb3e14c50d604ba6fd051
Author: Armand Grillet 
Date:   Mon Oct 22 09:33:02 2018 -0400

Added tenacity to 'pip-requirements' for new CLI.

This requirement will be used in upcoming new CLI integration tests.

Review: https://reviews.apache.org/r/69048/
{noformat}

> Add non-interactive test(s) for `mesos task exec`
> -
>
> Key: MESOS-9341
> URL: https://issues.apache.org/jira/browse/MESOS-9341
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9341) Add non-interactive test(s) for `mesos task exec`

2018-10-22 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658980#comment-16658980
 ] 

Kevin Klues commented on MESOS-9341:


{noformat}
commit 963de3b1811ef569449102192d40ca2cbed73b3c
Author: Armand Grillet 
Date:   Mon Oct 22 09:28:28 2018 -0400

Added 'exec_command' to test util functions for the new CLI.

This code was mostly pulled directly from:
https://github.com/dcos/dcos-core-cli/blob/
7fd55421939a7782c237e2b8719c0fe2f543acd7/
python/lib/dcoscli/dcoscli/test/common.py

This function will be used by tests that do not return a specific output
but an error code, stdout, and stderr. This will be the case for tests
concerning the 'task exec' and 'task attach' subcommands.

Review: https://reviews.apache.org/r/69114/
{noformat}

> Add non-interactive test(s) for `mesos task exec`
> -
>
> Key: MESOS-9341
> URL: https://issues.apache.org/jira/browse/MESOS-9341
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY

2018-10-22 Thread Armand Grillet (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet reassigned MESOS-9343:
-

Assignee: Armand Grillet

> Add test(s) for `mesos task attach` on task launched with a TTY 
> 
>
> Key: MESOS-9343
> URL: https://issues.apache.org/jira/browse/MESOS-9343
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9342) Add interactive test(s) for `mesos task exec`

2018-10-22 Thread Armand Grillet (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet reassigned MESOS-9342:
-

Assignee: Armand Grillet

> Add interactive test(s) for `mesos task exec`
> -
>
> Key: MESOS-9342
> URL: https://issues.apache.org/jira/browse/MESOS-9342
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py
> This will require new helper functions to get the input/output of the command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9345) Mesos should gc master log.

2018-10-22 Thread longfei (JIRA)
longfei created MESOS-9345:
--

 Summary: Mesos should gc master log.
 Key: MESOS-9345
 URL: https://issues.apache.org/jira/browse/MESOS-9345
 Project: Mesos
  Issue Type: Improvement
Reporter: longfei
 Attachments: image-2018-10-22-18-56-02-348.png

I have a Mesos cluster, which runs 10m+ short tasks every day.

As a result, the master's logs grow very fast.  

!image-2018-10-22-18-56-02-348.png!

But unlike agents' logs, master's logs will not be gc'd. And disk will be used 
up in a not very long time.

So I have to write a systemd timer to handle this, which cost coding time and 
deployment.

I suggest that these log files are auto-gc'd, just as agents' do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9341) Add non-interactive test(s) for `mesos task exec`

2018-10-22 Thread Armand Grillet (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet reassigned MESOS-9341:
-

Assignee: Armand Grillet

> Add non-interactive test(s) for `mesos task exec`
> -
>
> Key: MESOS-9341
> URL: https://issues.apache.org/jira/browse/MESOS-9341
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

2018-10-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658795#comment-16658795
 ] 

Qian Zhang edited comment on MESOS-9334 at 10/22/18 9:55 AM:
-

I added some logs into `libevent_poll.cpp` (see the diff below for details) and 
reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast(arg);
 
+  LOG(INFO) << "==pollCallback starts with fd " << fd
+<< " and with poll " << poll << "==";
+
   if (poll->promise.future().hasDiscard()) {
+LOG(INFO) << "==pollCallback discards with fd "
+  << fd << "==";
+
 poll->promise.discard();
   } else {
 // Convert libevent specific EV_READ / EV_WRITE to io::* specific
@@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg)
 short events =
   ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0);
 
+LOG(INFO) << "==pollCallback sets promise with fd " << fd
+  << " and with events " << events << "==";
+
 poll->promise.set(events);
   }
 
   // Deleting the `poll` also destructs `ev` and hence triggers `event_free`,
   // which makes the event non-pending.
   delete poll;
+
+  LOG(INFO) << "==pollCallback ends with fd " << fd << "==";
 }
 
 
 void pollDiscard(const std::weak_ptr& ev, short events)
 {
+  LOG(INFO) << "==pollDiscard is called==";
+
   // Discarding inside the event loop prevents `pollCallback()` from being
   // called twice if the future is discarded.
   run_in_event_loop([=]() {
@@ -78,6 +93,9 @@ Future poll(int_fd fd, short events)
 
   internal::Poll* poll = new internal::Poll();
 
+  LOG(INFO) << "==libevent starts polling with fd " << fd
+<< " and with poll " << poll << "==";
+
   Future future = poll->promise.future();
 
   // Convert io::READ / io::WRITE to libevent specific values of these
{code}
Here is what I found in the agent log when this issue occurred (fd 48 is the 
stderr file descriptor of `NetworkCniIsolatorSetup`):
{code:java}
I1021 15:57:45.00 2116 libevent_poll.cpp:96] ==libevent starts 
polling with fd 48 and with poll 0x7f60df029eb0==
I1021 15:57:45.00 2117 libevent_poll.cpp:41] ==pollCallback starts 
with fd 48 and with poll 0x7f60e6e56c70==
I1021 15:57:45.00 2117 libevent_poll.cpp:45] ==pollCallback 
discards with fd 48==
I1021 15:57:45.00 2117 libevent_poll.cpp:65] ==pollCallback ends 
with fd 48==
{code}
We can see libevent started to poll fd 48 with the poll object whose address is 
0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of 
poll object is different (0x7f60e6e56c70) which has been discarded (see the 
third log line) !!! 

And when I searched 0x7f60e6e56c70 in the agent log, I found:
{code:java}
I1021 15:57:22.00 2115 memory.cpp:478] Started listening for OOM events for 
container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5
I1021 15:57:22.00 2115 libevent_poll.cpp:96] ==libevent starts 
polling with fd 48 and with poll 0x7f60e6e56c70==
{code}
So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file 
descriptor (fd 48) which is used to listen OOM events for another container 
(4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 
15:57:44 right before agent started to wait the stderr of 
`NetworkCniIsolatorSetup` (15:57:45).
{code:java}
I1021 15:57:44.00  2114 containerizer.cpp:2459] Destroying container 
4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state
{code}
I reproduced this issue a couple of times, the observations I had from agent 
log are same as the above.


was (Author: qianzhang):
I added some logs into `libevent_poll.cpp` (see the diff below for details) and 
reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast(arg);
 
+  LOG(INFO) << "==pollCallback starts with fd " << fd
+<< " and with poll " << poll << "==";
+
   if (poll->promise.future().hasDiscard()) {
+LOG(INFO) << "==pollCallback discards with fd "
+  << fd << "==";
+
 poll->promise.discard();
   } else {
 // Convert libevent specific EV_READ / 

[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

2018-10-22 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658795#comment-16658795
 ] 

Qian Zhang commented on MESOS-9334:
---

I added some logs into `libevent_poll.cpp` (see the diff below for details) and 
reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast(arg);
 
+  LOG(INFO) << "==pollCallback starts with fd " << fd
+<< " and with poll " << poll << "==";
+
   if (poll->promise.future().hasDiscard()) {
+LOG(INFO) << "==pollCallback discards with fd "
+  << fd << "==";
+
 poll->promise.discard();
   } else {
 // Convert libevent specific EV_READ / EV_WRITE to io::* specific
@@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg)
 short events =
   ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0);
 
+LOG(INFO) << "==pollCallback sets promise with fd " << fd
+  << " and with events " << events << "==";
+
 poll->promise.set(events);
   }
 
   // Deleting the `poll` also destructs `ev` and hence triggers `event_free`,
   // which makes the event non-pending.
   delete poll;
+
+  LOG(INFO) << "==pollCallback ends with fd " << fd << "==";
 }
 
 
 void pollDiscard(const std::weak_ptr& ev, short events)
 {
+  LOG(INFO) << "==pollDiscard is called==";
+
   // Discarding inside the event loop prevents `pollCallback()` from being
   // called twice if the future is discarded.
   run_in_event_loop([=]() {
@@ -78,6 +93,9 @@ Future poll(int_fd fd, short events)
 
   internal::Poll* poll = new internal::Poll();
 
+  LOG(INFO) << "==libevent starts polling with fd " << fd
+<< " and with poll " << poll << "==";
+
   Future future = poll->promise.future();
 
   // Convert io::READ / io::WRITE to libevent specific values of these
{code}
Here is what I found in the agent log when this issue occurred (fd 48 is the 
stderr file descriptor of `NetworkCniIsolatorSetup`):

 
{code:java}
I1021 15:57:45.00 2116 libevent_poll.cpp:96] ==libevent starts 
polling with fd 48 and with poll 0x7f60df029eb0==
I1021 15:57:45.00 2117 libevent_poll.cpp:41] ==pollCallback starts 
with fd 48 and with poll 0x7f60e6e56c70==
I1021 15:57:45.00 2117 libevent_poll.cpp:45] ==pollCallback 
discards with fd 48==
I1021 15:57:45.00 2117 libevent_poll.cpp:65] ==pollCallback ends 
with fd 48==
{code}
We can see libevent started to poll fd 48 with the poll object whose address is 
0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of 
poll object is different (0x7f60e6e56c70) which has been discarded (see the 
third log line) !!! 

And when I searched 0x7f60e6e56c70 in the agent log, I found:

 
{code:java}
I1021 15:57:22.00 2115 memory.cpp:478] Started listening for OOM events for 
container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5
I1021 15:57:22.00 2115 libevent_poll.cpp:96] ==libevent starts 
polling with fd 48 and with poll 0x7f60e6e56c70==
{code}
So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file 
descriptor (fd 48) which is used to listen OOM events for another container 
(4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 
15:57:44 right before agent started to wait the stderr of 
`NetworkCniIsolatorSetup` (15:57:45).

 
{code:java}
I1021 15:57:44.00  2114 containerizer.cpp:2459] Destroying container 
4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state
{code}
I reproduced this issue a couple of times, the observations I had from agent 
log are same as the above.

 

> Container stuck at ISOLATING state due to libevent poll never returns
> -
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] 
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted 
> '/proc/5244/ns/net' to 
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' 
> for container 

[jira] [Commented] (MESOS-8780) Expose Check and HealthCheck information on Mesos HTTP endpoints.

2018-10-22 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658741#comment-16658741
 ] 

Alexander Rukletsov commented on MESOS-8780:


Let's keep this one open: it's good to have checks and health checks as much in 
sync as possible.

> Expose Check and HealthCheck information on Mesos HTTP endpoints.
> -
>
> Key: MESOS-8780
> URL: https://issues.apache.org/jira/browse/MESOS-8780
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Adam Medziński
>Assignee: Greg Mann
>Priority: Minor
>  Labels: api, integration, mesosphere
>
> Is the information about task health check definition not exposed on Mesos 
> HTTP endpoints ({{/master/tasks}} or {{/slave/state}} ) for some specific 
> reason? I'm working on integration with Hashicorp Consul and it would allow 
> me to synchronize the definitions of health checks only by using HTTP API. If 
> this information is not exposed by accident, I will gladly make a pull 
> request.
> This is related to both {{HealthCheck}} and {{CheckInfo}} in both {{v0}} and 
> {{v1}} APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)