[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

2018-10-25 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664603#comment-16664603
 ] 

Qian Zhang commented on MESOS-9334:
---

RR: https://reviews.apache.org/r/69123/

> Container stuck at ISOLATING state due to libevent poll never returns
> -
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] 
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted 
> '/proc/5244/ns/net' to 
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' 
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] 
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
>  In the above logs, the state of container 
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at 
> 09:13:23, but did not transitioned to any other states until it was destroyed 
> due to the executor registration timeout (10 mins). And the destroy can never 
> complete since it needs to wait for the container to finish isolating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2018-10-25 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664504#comment-16664504
 ] 

Greg Mann commented on MESOS-7564:
--

I think that luckily the backward-compatibility story of this change should be 
fairly simple; executors can opt-in to heartbeats in their SUBSCRIBE call, 
either with a special field or by specifying the correct Content-Type.

We've had some users with CNI deployments that close executor <=> agent 
connections after extended periods of inactivity, so this fix will be 
prioritized soon.

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2018-10-25 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664483#comment-16664483
 ] 

Joseph Wu edited comment on MESOS-7974 at 10/26/18 12:58 AM:
-

The relevant review in the chain of MESOS-9258 that will fix this:
https://reviews.apache.org/r/69185/

(Note: Without this change in the chain, the master actor (after applying other 
patches) will crash upon receiving the streaming headers :D )


was (Author: kaysoky):
The relevant review in the chain of MESOS-9258 that will fix this:
https://reviews.apache.org/r/69185/

(Note: Without this change, the master actor will crash upon receiving the 
streaming headers :D )

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>Assignee: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2018-10-25 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664483#comment-16664483
 ] 

Joseph Wu commented on MESOS-7974:
--

The relevant review in the chain of MESOS-9258 that will fix this:
https://reviews.apache.org/r/69185/

(Note: Without this change, the master actor will crash upon receiving the 
streaming headers :D )

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>Assignee: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-10-25 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664480#comment-16664480
 ] 

Joseph Wu commented on MESOS-9258:
--

Still in progress, but a prototype is up for preliminary review starting here: 
https://reviews.apache.org/r/69180/

The idea is to let the {{master /api/v1 SUBSCRIBE}} call take a streaming 
request (optional) as well as a streaming response.  When the call is made via 
a streaming request, the same stream will be used to send heartbeats from 
client to master.

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9079) Test MasterTestPrePostReservationRefinement.LaunchGroup is flaky.

2018-10-25 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9079:
---

Assignee: Meng Zhu

> Test MasterTestPrePostReservationRefinement.LaunchGroup is flaky.
> -
>
> Key: MESOS-9079
> URL: https://issues.apache.org/jira/browse/MESOS-9079
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> MasterTestPrePostReservationRefinement_LaunchGroup_0_badrun.txt
>
>
> Flaky on CI mac-SSL.Mesos
> Error Message
> {noformat}
> ../../src/tests/master_tests.cpp:9270
> Failed to wait 15secs for runningUpdate
> {noformat}
> Log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9356) Make agent atomically checkpoint operations and resources

2018-10-25 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9356:
-

Assignee: Gastón Kleiman

> Make agent atomically checkpoint operations and resources
> -
>
> Key: MESOS-9356
> URL: https://issues.apache.org/jira/browse/MESOS-9356
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: agent, mesosphere, operation-feedback
>
> See 
> https://docs.google.com/document/d/1HxMBCfzU9OZ-5CxmPG3TG9FJjZ_-xDUteLz64GhnBl0/edit
>  for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9357) FetcherTest.DuplicateFileURI fails on macos

2018-10-25 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9357:
---

 Summary: FetcherTest.DuplicateFileURI fails on macos
 Key: MESOS-9357
 URL: https://issues.apache.org/jira/browse/MESOS-9357
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Bannier


I see {{FetcherTest.DuplicateFileURI}} fail pretty reliably on macos, e.g., 
10.14.
{noformat}
../../src/tests/fetcher_tests.cpp:173
Value of: os::exists("two")
  Actual: false
Expected: true
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7434) SlaveTest.RestartSlaveRequireExecutorAuthentication is flaky.

2018-10-25 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664299#comment-16664299
 ] 

Meng Zhu commented on MESOS-7434:
-

Observed this again today on mac, the failure is still regarding the `cat` 
command which leads to premature container termination. I will cross-post 
[~kaysoky]'s comment from Greg's patch above for visibility:

bq. From the logs I've seen, the cat command seems to be exiting due to a pipe 
closure.
bq. 
bq. In the past, commands like this would be launched sharing the stdin of the 
agent process (which in tests, is equal to the test process).  But after the 
introduction of the IO switchboard, there are more layers to consider:
bq. 
bq. 1) If the container is launched with a tty_info (not the case in this 
test), the stdin will come from a TTY.
bq. 2) In local mode, the stdin is shared with the parent process.
bq. 3) In normal mode (this test), the stdin will be a pipe to the IO 
switchboard server process.
bq. 
bq. Perhaps, when the agent gets restarted in the test, it ends up killing the 
IO switchboard server somehow?  The agent restart is a semi-graceful shutdown, 
meaning it may call destructors.  In an actual agent restart, there may not be 
time to call destructors.
bq. 
bq. So TL;DR: Investigate if the IO Switchboard server is dying in some test 
runs.

> SlaveTest.RestartSlaveRequireExecutorAuthentication is flaky.
> -
>
> Key: MESOS-7434
> URL: https://issues.apache.org/jira/browse/MESOS-7434
> Project: Mesos
>  Issue Type: Bug
> Environment: Debian 8
> CentOS 6
> other Linux distros
>Reporter: Greg Mann
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: RestartSlaveRequireExecutorAuthentication is 
> flaky_failure_log_centos6.txt, 
> RestartSlaveRequireExecutorAuthentication_failure_log_debian8.txt, 
> SlaveTest.RestartSlaveRequireExecAuth-Ubuntu-16.txt
>
>
> This test failure has been observed on an internal CI system. It occurs on a 
> variety of Linux distributions. It seems that using {{cat}} as the task 
> command may be problematic; see attached log file 
> {{SlaveTest.RestartSlaveRequireExecutorAuthentication.txt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9320) UCR container launch stuck at PROVISIONING during image fetching.

2018-10-25 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9320:
---

Assignee: Gilbert Song

> UCR container launch stuck at PROVISIONING during image fetching.
> -
>
> Key: MESOS-9320
> URL: https://issues.apache.org/jira/browse/MESOS-9320
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerizer
>
> We observed mesos containerizer stuck at PROVISIONING when launching a mesos 
> container using docker image: 
> `kvish/jenkins-dev:595c74f713f609fd1d3b05a40d35113fc03227c9`:
> The image pulling never finishes. Insufficient image contents are still in 
> image store staging directory 
> /var/lib/mesos/slave/store/docker/staging/egLYqO, forever.
> {noformat}
> OK-22:50:06-root@int-agent89-mwst9:/var/lib/mesos/slave/store/docker/staging/egLYqO
>  # ls -alh
> total 1.1G
> drwx--. 2 root root 4.0K Oct 15 13:02 .
> drwxr-xr-x. 3 root root   20 Oct 15 22:40 ..
> -rw-r--r--. 1 root root  59K Oct 15 13:02 manifest
> -rw-r--r--. 1 root root 2.6K Oct 15 13:02 
> sha256:08239cb71d7a3e0d8ed680397590b338a2133117250e1a3e2ee5c5c45292db63
> -rw-r--r--. 1 root root  440 Oct 15 13:02 
> sha256:0984904c0e1558248eb25e93d9fc14c47c0052d58569e64c185afca93a060b66
> -rw-r--r--. 1 root root  248 Oct 15 13:02 
> sha256:0bbc7b377a9155696eb0b684bd1999bc43937918552d73fd9697ea50ef46528a
> -rw-r--r--. 1 root root  240 Oct 15 13:02 
> sha256:0c5c0c095e351b976943453c80271f3b75b1208dbad3ca7845332e873361f3bb
> -rw-r--r--. 1 root root  562 Oct 15 13:02 
> sha256:1558b7c35c9e25577ee719529d6fcdddebea68f5bdf8cbdf13d8d75a02f8a5b1
> -rw-r--r--. 1 root root  11M Oct 15 13:02 
> sha256:1ab373b3deaed929a15574ac1912afc6e173f80d400aba0e96c89f6a58961f2d
> -rw-r--r--. 1 root root  130 Oct 15 13:02 
> sha256:1b6c70b3786f72e5255ccd51e27840d1c853a17561b5e94a4359b17d27494d50
> -rw-r--r--. 1 root root  176 Oct 15 13:02 
> sha256:1bf4aab5c3b363b4fdfc46026df9ae854db8858a5cbcccdd4409434817d59312
> -rw-r--r--. 1 root root  380 Oct 15 13:02 
> sha256:213b0c5bb5300df1d2d06df6213ae94448419cf18ecf61358e978a5d25651d5a
> -rw-r--r--. 1 root root  71M Oct 15 13:02 
> sha256:31aaab384e3fa66b73eced4870fc96be590a2376e93fd4f8db5d00f94fb11604
> -rw-r--r--. 1 root root 1.4K Oct 15 13:02 
> sha256:32442b7d159ed2b7f00b00a989ca1d3ee1a3f566df5d5acbd25f0c3dfdad69d1
> -rw-r--r--. 1 root root 653K Oct 15 13:02 
> sha256:340cd692075b636b5e1803fcde9b1a56a2f6e2728e4fb10f7295d39c7d0e0d01
> -rw-r--r--. 1 root root  184 Oct 15 13:02 
> sha256:398819b00c6cbf9cce6c1ed25005c9e1242cace7a6436730e17da052000c7f90
> -rw-r--r--. 1 root root 366K Oct 15 13:02 
> sha256:41d78c0cb1b2a47189068e55f61d6266be14c4fa75935cb021f17668dd8e7f94
> -rw-r--r--. 1 root root  23K Oct 15 13:02 
> sha256:4f5852c22c7ce0155494b6e86a0a4c536c3c95cb87cad84806aa2d56184b95d2
> -rw-r--r--. 1 root root 384M Oct 15 13:02 
> sha256:4fe621515c4d23e33d9850a6cdfc3aa686d790704b9c5569f1726b4469aa30c0
> -rw-r--r--. 1 root root 1.5K Oct 15 13:02 
> sha256:50dcd1d0618b1d42bf6633dc8176e164571081494fa6483ec4489a59637518bc
> -rw-r--r--. 1 root root  48M Oct 15 13:02 
> sha256:57c8de432dbe337bb6cb1ad328e6c564303a3d3fd05b5e872fd9c47c16fdd02c
> -rw-r--r--. 1 root root  30M Oct 15 13:02 
> sha256:63a0f0b6b5d7014b647ac4a164808208229d2e3219f45a39914f0561a4f831bf
> -rw-r--r--. 1 root root 306M Oct 15 13:02 
> sha256:67f41ed73c082c6ffee553a90b0abd56bc74b260d90b9d594d652b66cbcd5e7f
> -rw-r--r--. 1 root root  435 Oct 15 13:02 
> sha256:6cb303e084ed78386ae87cdaf95e8817d48e94b3ce7c0442a28335600f0efa3d
> -rw-r--r--. 1 root root 5.5K Oct 15 13:02 
> sha256:7d4d905c2060a5ec994ec201e6877714ee73030ef4261f9562abdb0f844174d5
> -rw-r--r--. 1 root root  39M Oct 15 13:02 
> sha256:80d923f4b955c2db89e2e8a9f2dcb0c36a29c1520a5b359578ce2f3d0b849d10
> -rw-r--r--. 1 root root  615 Oct 15 13:02 
> sha256:842cc8bd099d94f6f9c082785bbaa35439af965d1cf6a13300830561427c266b
> -rw-r--r--. 1 root root  712 Oct 15 13:02 
> sha256:977c8e6687e0ca5f0682915102c025dc12d7ff71bf70de17aab3502adda25af2
> -rw-r--r--. 1 root root  12K Oct 15 13:02 
> sha256:989ac24c53a1f7951438aa92ac39bc9053c178336bea4ebe6ab733d4975c9728
> -rw-r--r--. 1 root root  861 Oct 15 13:02 
> sha256:a18e3c45bf91ac3bd11a46b489fb647a721417f60eae66c5f605360ccd8d6352
> -rw-r--r--. 1 root root   32 Oct 15 13:02 
> sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
> -rw-r--r--. 1 root root 266K Oct 15 13:02 
> sha256:b1d3e8de8ec6d87b8485a8a3b66d63125a033cfb0711f8af24b4f600f524e276
> -rw-r--r--. 1 root root 1.6K Oct 15 13:02 
> sha256:b3a122ff7868d2ed9c063df73b0bf67fd77348d3baa2a92368b3479b41f8aa74
> -rw-r--r--. 1 root root 4.2M Oct 15 13:02 
> 

[jira] [Created] (MESOS-9356) Make agent atomically checkpoint operations and resources

2018-10-25 Thread JIRA
Gastón Kleiman created MESOS-9356:
-

 Summary: Make agent atomically checkpoint operations and resources
 Key: MESOS-9356
 URL: https://issues.apache.org/jira/browse/MESOS-9356
 Project: Mesos
  Issue Type: Task
Reporter: Gastón Kleiman


See 
https://docs.google.com/document/d/1HxMBCfzU9OZ-5CxmPG3TG9FJjZ_-xDUteLz64GhnBl0/edit
 for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.

2018-10-25 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6240:


Assignee: (was: Benjamin Hindman)

> Allow executor/agent communication over non-TCP/IP stream socket.
> -
>
> Key: MESOS-6240
> URL: https://issues.apache.org/jira/browse/MESOS-6240
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
> Environment: Linux and Windows
>Reporter: Avinash Sridharan
>Priority: Major
>  Labels: mesosphere
>
> Currently, the executor agent communication happens specifically over TCP 
> sockets. This works fine in most cases, but specifically for the 
> `MesosContainerizer` when containers are running on CNI networks, this mode 
> of communication starts imposing constraints on the CNI network. Since, now 
> there has to connectivity between the CNI network  (on which the executor is 
> running) and the agent. Introducing paths from a CNI network to the 
> underlying agent, at best, creates headaches for operators and at worst 
> introduces serious security holes in the network, since it is breaking the 
> isolation between the container CNI network and the host network (on which 
> the agent is running).
> In order to simplify/strengthen deployment of Mesos containers on CNI 
> networks we therefore need to move away from using TCP/IP sockets for 
> executor/agent communication. Since, executor and agent are guaranteed to run 
> on the same host, the above problems can be resolved if, for the 
> `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of 
> TCP/IP sockets for the executor/agent communication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6417) Introduce an extra 'unknown' health check state.

2018-10-25 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663956#comment-16663956
 ] 

Greg Mann commented on MESOS-6417:
--

I recently sent out a preliminary API design for this update; [~alexr] and I 
will do some investigation to ensure that this plan fits with what we envision 
for future updates to the health check protos, write a short document, and post 
it here.

> Introduce an extra 'unknown' health check state.
> 
>
> Key: MESOS-6417
> URL: https://issues.apache.org/jira/browse/MESOS-6417
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Greg Mann
>Priority: Major
>  Labels: health-check, mesosphere
>
> There are three logical states regarding health checks:
> 1) no health checks;
> 2) a health check is defined, but no result is available yet;
> 3) a health check is defined, it is either healthy or not.
> Currently, we do not distinguish between 1) and 2), which can be problematic 
> for framework authors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6417) Introduce an extra 'unknown' health check state.

2018-10-25 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663955#comment-16663955
 ] 

Greg Mann commented on MESOS-6417:
--

[~urbanserj] unfortunately I think that solution wouldn't work, since it's a 
breaking API change. There could be frameworks which react upon the first 
'false' health check detected, and if we alter the time at which we populate 
that field it could break such deployments.

> Introduce an extra 'unknown' health check state.
> 
>
> Key: MESOS-6417
> URL: https://issues.apache.org/jira/browse/MESOS-6417
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Greg Mann
>Priority: Major
>  Labels: health-check, mesosphere
>
> There are three logical states regarding health checks:
> 1) no health checks;
> 2) a health check is defined, but no result is available yet;
> 3) a health check is defined, it is either healthy or not.
> Currently, we do not distinguish between 1) and 2), which can be problematic 
> for framework authors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9317) Some master endpoints do not handle failed authorization properly.

2018-10-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663884#comment-16663884
 ] 

Gastón Kleiman commented on MESOS-9317:
---

cc/ [~arojas] are you aware of this one?

> Some master endpoints do not handle failed authorization properly.
> --
>
> Key: MESOS-9317
> URL: https://issues.apache.org/jira/browse/MESOS-9317
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Alexander Rukletsov
>Priority: Blocker
>  Labels: authorization, integration, mesosphere, reliability, 
> security
>
> When we authorize _some_ actions (right now I see this happening to create / 
> destroy volumes, reserve / unreserve resources) *and* {{authorizer}} fails 
> (i.e. returns the future in non-ready state), an assertion is triggered:
> {noformat}
> mesos-master[49173]: F1015 11:40:29.795748 49396 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attempts
> {noformat}
> This is due to incorrect assumption in our code, see for example 
> [https://github.com/apache/mesos/blob/a063afce9868dcee38a0ab7efaa028244f3999cf/src/master/master.cpp#L3752-L3763]:
> {noformat}
>   return await(authorizations)
>   .then([](const vector>& authorizations)
> -> Future {
> // Compute a disjunction.
> foreach (const Future& authorization, authorizations) {
>   if (!authorization.get()) {
> return false;
>   }
> }
> return true;
>   });
> {noformat}
> Futures returned from {{await}} are guaranteed to be in terminal state, but 
> not necessarily ready! In the snippet above, {{!authorization.get()}} is 
> invoked without being checked ⇒ assertion fails.
> Full stack trace:
> {noformat}
> Oct 15 11:40:39 int-master2-mwst9.scaletesting.mesosphe.re 
> mesos-master[49173]: F1015 11:40:29.795748 49396 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attemptsF1015 11:40:29.796037 49395 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attemptsF1015 11:40:29.796097 49384 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attemptsF1015 11:40:29.796249 49393 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attemptsF1015 11:40:29.796375 49390 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attemptsF1015 11:40:29.796483 49388 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions due to 
> time out after 3 attemptsF1015 11:40:29.796629 49381 future.hpp:1306] Check 
> failed: !isFailed() Future::get() but state == FAILED: Failed to retrieve 
> permissions from IAM at url 
> https://localhost:443/acs/api/v1/users/marathon_user_ee/permissions the 
> request failed: Failed to contact bouncer at 
> 

[jira] [Comment Edited] (MESOS-8403) Add agent HTTP API operator call to mark local resource providers as gone

2018-10-25 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565525#comment-16565525
 ] 

Benjamin Bannier edited comment on MESOS-8403 at 10/25/18 10:47 AM:


Reviews:

-[https://reviews.apache.org/r/68143/]-
 -[https://reviews.apache.org/r/68144/]-
 -[https://reviews.apache.org/r/68146/]-
 [https://reviews.apache.org/r/68147/]
https://reviews.apache.org/r/69158/
https://reviews.apache.org/r/69158/


was (Author: bbannier):
Reviews:

-[https://reviews.apache.org/r/68143/]-
 -[https://reviews.apache.org/r/68144/]-
 -[https://reviews.apache.org/r/68146/]-
 [https://reviews.apache.org/r/68147/]

> Add agent HTTP API operator call to mark local resource providers as gone
> -
>
> Key: MESOS-8403
> URL: https://issues.apache.org/jira/browse/MESOS-8403
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere
>
> It is currently not possible to mark local resource providers as gone (e.g., 
> after agent reconfiguration). As resource providers registered at earlier 
> times could still be cached in a number of places, e.g., the agent or the 
> master, the only way to e.g., prevent this cache from growing too large is to 
> fail over caching components (to e.g., prevent an agent cache to update a 
> fresh master cache during reconciliation).
> Showing unavailable and known to be gone resource providers in various 
> endpoints is likely also confusing to users.
> We should add an operator call to mark resource providers as gone. While the 
> entity managing resource provider subscription state is the resource provider 
> manager, it still seems to make sense to add this operator call to the agent 
> API as currently only local resource providers are supported. The agent would 
> then forward the call to the resource provider manager which would transition 
> its state for the affected resource provider, e.g., setting its state to 
> {{GONE}} and removing it from the list of known resource providers, and then 
> send out an update to its subscribers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9164) Subprocess should unset CLOEXEC on whitelisted file descriptors.

2018-10-25 Thread Qian Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-9164:
-

Assignee: Qian Zhang  (was: James Peach)

> Subprocess should unset CLOEXEC on whitelisted file descriptors.
> 
>
> Key: MESOS-9164
> URL: https://issues.apache.org/jira/browse/MESOS-9164
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>Assignee: Qian Zhang
>Priority: Major
>
> The libprocess subprocess API accepts a set of whitelisted file descriptors 
> that are supposed to  be inherited to the child process. On windows, these 
> are used, but otherwise the subprocess API just ignores them. We probably 
> should make sure that the API clears the {{CLOEXEC}} flag on this descriptors 
> so that they are inherited to the child.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9355) Persistence volume does not unmount correctly with wrong artifact URI

2018-10-25 Thread Ken Liu (JIRA)
Ken Liu created MESOS-9355:
--

 Summary: Persistence volume does not unmount correctly with wrong 
artifact URI
 Key: MESOS-9355
 URL: https://issues.apache.org/jira/browse/MESOS-9355
 Project: Mesos
  Issue Type: Bug
  Components: agent, containerization
Affects Versions: 1.5.1, 1.5.2
 Environment: DCOS 1.11.6

Mesos 1.5.2
Reporter: Ken Liu


DCOS service json file is like following. When you type wrong uri, for example, 
"file://root/test/http.tar.bz2", but the correct one is 
"file:///root/test/http.tar.bz2". Then it will leave all the persistence mount 
on the agent, and after gc_delay timeout, the mount path is still there.
{code:java}
{ "id": "/http-server", "backoffFactor": 1.15, "backoffSeconds": 1, "cmd": 
"python http.py", "constraints": [ [ "hostname", "CLUSTER", "172.27.12.216" ] 
], "container": { "type": "MESOS", "volumes": [ { "persistent": { "type": 
"root", "size": 2048, "constraints": [] }, "mode": "RW", "containerPath": 
"ken-http" } ] }, "cpus": 0.1, "disk": 0, "fetch": [ { "uri": 
"file://root/test/http.tar.bz2", "extract": true, "executable": false, "cache": 
false } ], "instances": 0, "maxLaunchDelaySeconds": 3600, "mem": 128, "gpus": 
0, "networks": [ { "mode": "host" } ], "portDefinitions": [], "residency": { 
"relaunchEscalationTimeoutSeconds": 3600, "taskLostBehavior": "WAIT_FOREVER" }, 
"requirePorts": false, "upgradeStrategy": { "maximumOverCapacity": 0, 
"minimumHealthCapacity": 0 }, "killSelection": "YOUNGEST_FIRST", 
"unreachableStrategy": "disabled", "healthChecks": [] }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8608) RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.

2018-10-25 Thread Frank Greguska (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663320#comment-16663320
 ] 

Frank Greguska commented on MESOS-8608:
---

So the test itself is trying to create a bind mount:  
[https://github.com/apache/mesos/blob/798cdab3a23b00b8bcf686e08c15c84dbaf3386b/3rdparty/stout/tests/os/rmdir_tests.cpp#L441]
 

Which apparently is not allowed under docker unless running in privileged mode: 
[https://github.com/moby/moby/issues/5254]

But you can't build in privileged mode: 
[https://github.com/moby/moby/issues/1916] 

So I think the only solution I can come up with at the moment is to not run 
`make check` during my build

> RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.
> -
>
> Key: MESOS-8608
> URL: https://issues.apache.org/jira/browse/MESOS-8608
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake
>Affects Versions: 1.4.1, 1.8.0
> Environment: Docker 17.12.0
> Ubuntu 16.04, Ubuntu 18.04
>Reporter: Pierre-Louis Chevallier
>Priority: Critical
>  Labels: newbie, test
>
> I'm trying to run mesos on docker and when i "make check", i have 1 test that 
> is failed, i followed all the requirements & instructions on mesos getting 
> started guide. The Failed test say 
> RmDirContinueOnErrorTest.RemoveWithContinuedOnError 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)