[jira] [Assigned] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9672:
---

Assignee: Qian Zhang

> Docker containerizer should ignore pids of executors that do not pass the 
> connection check.
> ---
>
> Key: MESOS-9672
> URL: https://issues.apache.org/jira/browse/MESOS-9672
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: containerization
>
> When recovering executors with a tracked pid we first try to establish a 
> connection to its libprocess address to avoid reaping an irrelevant process:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054
> If the connection fails to establish, we should not track its pid: 
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071
> One trouble this might cause is that if the pid is being used by another 
> executor, this could lead to duplicate pid error and lead the agent into a 
> crash loop:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9695) Remove the duplicate pid check in Docker containerizer

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9695:
---

Shepherd: Gilbert Song
Assignee: Qian Zhang
  Sprint: Containerization: RI13 Sp 45

> Remove the duplicate pid check in Docker containerizer
> --
>
> Key: MESOS-9695
> URL: https://issues.apache.org/jira/browse/MESOS-9695
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>
> In `DockerContainerizerProcess::_recover`, we check if there are two 
> executors use duplicate pid, and error out if we find duplicate pid (see 
> [here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/docker.cpp#L1068:L1078]
>  for details). However I do not see the value this check can give us but it 
> will cause serious issue (agent crash loop when restarting) in rare case (a 
> new executor reuse pid of an old executor), so I think we'd better to remove 
> it from Docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8769) Agent crashes when CNI config not defined

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-8769:
---

Assignee: Gilbert Song

> Agent crashes when CNI config not defined
> -
>
> Key: MESOS-8769
> URL: https://issues.apache.org/jira/browse/MESOS-8769
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.1
>Reporter: Alena Varkockova
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: cni, containerizer
>
> I was deploying an application through marathon in an integration test that 
> looked like this:
>  * Mesos container (UCR)
>  * container network
>  * some network name specified
> Given network name did not exist, I did not even passed CNI config to the 
> agent.
> After Mesos tried to deploy my task, the agent crashed because of missing CNI 
> config.
> {code}
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] *** 
> SIGABRT (@0x1980) received by PID 6528 (TID 0x7f3124b58700) from PID 6528; 
> stack trace: ***
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e5c2890 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e23d067 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e23e448 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e236266 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f312e236312 (unknown)
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f31304fd233 _ZNKR6OptionISsE3getEv.part.103
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f313050b60c 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f313050bd54 mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f313027b903 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEESt5_BindIFZNS0_8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDERKNS8_15ContainerConfigESG_SJ_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSO_FSM_T1_T2_EOT3_OT4_EUlRSE_RSH_S2_E_SE_SH_St12_PlaceholderILi1E9_M_invokeERKSt9_Any_dataS2_
> [31mWARN [0;39m[10:51:53 AppDeployIntegrationTest-MesosAgent-32780] @ 
> 0x7f3130a7ee29 process::ProcessManager::resume()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9355) Persistence volume does not unmount correctly with wrong artifact URI

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9355:
---

Assignee: Joseph Wu

> Persistence volume does not unmount correctly with wrong artifact URI
> -
>
> Key: MESOS-9355
> URL: https://issues.apache.org/jira/browse/MESOS-9355
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.5.1, 1.5.2
> Environment: DCOS 1.11.6
> Mesos 1.5.2
>Reporter: Ken Liu
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: persistent-volumes
>
> DCOS service json file is like following. When you type wrong uri, for 
> example, "file://root/test/http.tar.bz2", but the correct one is 
> "file:///root/test/http.tar.bz2". Then it will leave all the persistence 
> mount on the agent, and after gc_delay timeout, the mount path is still there.
> It means if it failed 10 times, then there is 10 persistence volume mount on 
> the agent.
> *Excepted Result: When task is failed,  dangling mount points should be 
> unmounted correctly.*
> {code:java}
> {
> "id": "/http-server",
> "backoffFactor": 1.15,
> "backoffSeconds": 1,
> "cmd": "python http.py",
> "constraints": [],
> "container": {
> "type": "MESOS",
> "volumes": [
> {
> "persistent": {
> "type": "root",
> "size": 2048,
> "constraints": []
> },
> "mode": "RW",
> "containerPath": "ken-http"
> }
> ]
> },
> "cpus": 0.1,
> "disk": 0,
> "fetch": [
> {
> "uri": "file://root/test/http.tar.bz2",
> "extract": true,
> "executable": false,
> "cache": false
> }
> ],
> "instances": 0,
> "maxLaunchDelaySeconds": 3600,
> "mem": 128,
> "gpus": 0,
> "networks": [
> {
> "mode": "host"
> }
> ],
> "portDefinitions": [],
> "residency": {
> "relaunchEscalationTimeoutSeconds": 3600,
> "taskLostBehavior": "WAIT_FOREVER"
> },
> "requirePorts": false,
> "upgradeStrategy": {
> "maximumOverCapacity": 0,
> "minimumHealthCapacity": 0
> },
> "killSelection": "YOUNGEST_FIRST",
> "unreachableStrategy": "disabled",
> "healthChecks": []
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup

2019-04-24 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9306:
---

Assignee: Andrei Budnik

> Mesos containerizer can get stuck during cgroup cleanup
> ---
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> {code}
> The last log line from the containerizer's destroy path is:
> {code}
> 14:20:50.307531  6805 linux_launcher.cpp:654] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> {code}
> (that is the second such log line, from {{LinuxLauncherProcess::_destroy}})
> Then we just see
> {code}
> containerizer.cpp:2401] 

[jira] [Commented] (MESOS-9742) If a HTTP endpoint goes away before finishing sending of data HTTP requests hang

2019-04-24 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825471#comment-16825471
 ] 

Benjamin Bannier commented on MESOS-9742:
-

Linking related MESOS-6778 which is about having callers of  {{http::request}} 
close the connection.

> If a HTTP endpoint goes away before finishing sending of data HTTP requests 
> hang
> 
>
> Key: MESOS-9742
> URL: https://issues.apache.org/jira/browse/MESOS-9742
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Bannier
>Priority: Major
>
> If a HTTP request is made to a remote that goes away before it finishes 
> sending its data the HTTP request hangs forever.
> {code}
> TEST_P(HTTPTest, NOPE)
> {
>   Try create = inet::Socket::create();
>   ASSERT_SOME(create);
>   Future response;
>   {
> // Post a request which never gets a response.
> Http http;
> EXPECT_CALL(*http.process, body(_));
> response = http::post(http.process->self(), "body/");
> // Wait for some time so the request was posted. There's probably
> // some internal state we could wait for.
> ASSERT_SOME(os::sleep(Milliseconds(300)));
>   }
>   AWAIT_FAILED(response); // Hangs.
> }
> {code}
> While this has likely been an issue for some time it came up with the 
> introduction of agent components which communicate with the agent over HTTP 
> connections, e.g., the for the container daemon or storage local resource 
> providers. Here it becomes hard to reason about the life cycle of async call 
> chains, and also introduces some issues when e.g., executing tests in 
> repetition where we effectively leak sockets (by having {{Future}} holding on 
> to the sockets but never reaching a terminal state), see MESOS-8428.
> We should evaluate whether we can turn a closed socket into e.g., a failed 
> {{Future}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9742) If a HTTP endpoint goes away before finishing sending of data HTTP requests hang

2019-04-24 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9742:
---

 Summary: If a HTTP endpoint goes away before finishing sending of 
data HTTP requests hang
 Key: MESOS-9742
 URL: https://issues.apache.org/jira/browse/MESOS-9742
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Bannier


If a HTTP request is made to a remote that goes away before it finishes sending 
its data the HTTP request hangs forever.
{code}
TEST_P(HTTPTest, NOPE)
{
  Try create = inet::Socket::create();
  ASSERT_SOME(create);

  Future response;

  {
// Post a request which never gets a response.
Http http;
EXPECT_CALL(*http.process, body(_));
response = http::post(http.process->self(), "body/");

// Wait for some time so the request was posted. There's probably
// some internal state we could wait for.
ASSERT_SOME(os::sleep(Milliseconds(300)));
  }

  AWAIT_FAILED(response); // Hangs.
}
{code}

While this has likely been an issue for some time it came up with the 
introduction of agent components which communicate with the agent over HTTP 
connections, e.g., the for the container daemon or storage local resource 
providers. Here it becomes hard to reason about the life cycle of async call 
chains, and also introduces some issues when e.g., executing tests in 
repetition where we effectively leak sockets (by having {{Future}} holding on 
to the sockets but never reaching a terminal state), see MESOS-8428.

We should evaluate whether we can turn a closed socket into e.g., a failed 
{{Future}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825419#comment-16825419
 ] 

Gilbert Song edited comment on MESOS-8522 at 4/24/19 6:37 PM:
--

probably we could just simply check os::exists(mount.target) for this case, 
assuming the mount point is cleaned up when the target is unmounted?


was (Author: gilbert):
probably we could just simply check os::exists(mount.target) for this case?

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825419#comment-16825419
 ] 

Gilbert Song commented on MESOS-8522:
-

probably we could just simply check os::exists(mount.target) for this case?

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825418#comment-16825418
 ] 

Gilbert Song commented on MESOS-8522:
-

[~chhsia0][~bbannier] what is the priority of this issue? does it only happen 
when there is a race with flapping docker containers?

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8511) Provide a v0/v1 test scheduler to simplify the tests.

2019-04-24 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8511:
--

Assignee: Benjamin Mahler

> Provide a v0/v1 test scheduler to simplify the tests.
> -
>
> Key: MESOS-8511
> URL: https://issues.apache.org/jira/browse/MESOS-8511
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: tech-debt
>
> Currently, there are a lot of tests that just want to launch a task in order 
> to test some behavior of the system. These tests have to create their own v0 
> or v1 scheduler and invoke the necessary calls on it and expect the necessary 
> calls / messages back. This is rather verbose.
> It would be helpful to have some better abstractions here, like a 
> TestScheduler that can launch tasks and exposes the status updates for them, 
> along with other interesting information. E.g.
> {code}
> class TestScheduler
> {
>   // Add the task to the queue of tasks that need to be launched.
>   // Returns the stream of status updates for this task.
>   Queue addTask(const TaskInfo& t);
>   etc
> }
> {code}
> Probably this could be implemented against both v0 and v1, if we want to 
> parameterize the tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9701) Allocator's roles map should track reservations.

2019-04-24 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9701:
--

Assignee: Andrei Sekretenko

> Allocator's roles map should track reservations.
> 
>
> Key: MESOS-9701
> URL: https://issues.apache.org/jira/browse/MESOS-9701
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: resource-management
>
> Currently, the allocator's {{roles}} map only tracks roles that have 
> allocations or framework subscriptions:
> https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L531-L535
> And we separately track a map of total reservations for each role:
> https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L541-L547
> Confusingly, the {{roles}} map won't have an entry when there is a 
> reservation for a role but no allocations or frameworks subscribed. We should 
> ensure that the map has an entry when there are reservations. Also, we can 
> consolidate the reservation information and framework ids into the same map, 
> e.g.:
> {code}
> struct Role
> {
>   hashset frameworkIds;
>   ResourceQuantities totalReservations;
> };
> hashmap roles;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8522) `prepareMounts` in Mesos containerizer is flaky.

2019-04-24 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825335#comment-16825335
 ] 

Benjamin Bannier commented on MESOS-8522:
-

[~jieyu], are you working on this? If not, let's talk with e.g., [~gilbert] to 
get this onto somebody else's plate.

> `prepareMounts` in Mesos containerizer is flaky.
> 
>
> Key: MESOS-8522
> URL: https://issues.apache.org/jira/browse/MESOS-8522
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> The 
> [{{prepareMount()}}|https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L244]
>  function in {{src/slave/containerizer/mesos/launch.cpp}} sometimes fails 
> with the following error:
> {noformat}
> Failed to prepare mounts: Failed to mark 
> '/home/docker/containers/af78db6ebc1aff572e576b773d1378121a66bb755ed63b3278e759907e5fe7b6/shm'
>  as slave: Invalid argument
> {noformat}
> The error message comes from 
> https://github.com/apache/mesos/blob/1.5.x/src/slave/containerizer/mesos/launch.cpp#L#L326.
> Although it does not happen frequently, it can be reproduced by running tests 
> that need to clone mount namespaces in repetition. For example, I just 
> reproduced the bug with the following command after 17 minutes:
> {noformat}
> sudo bin/mesos-tests.sh --gtest_filter='*ROOT_PublishResourcesRecovery' 
> --gtest_break_on_failure --gtest_repeat=-1 --verbose
> {noformat}
> No that in this example, the test itself does not involve any docker image or 
> docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7309) Support specifying devices for a container.

2019-04-24 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-7309:
--

Assignee: Chun-Hung Hsiao

> Support specifying devices for a container.
> ---
>
> Key: MESOS-7309
> URL: https://issues.apache.org/jira/browse/MESOS-7309
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: mesosphere, storage
>
> Some container requires some devices to be available in the container (e.g., 
> /dev/fuse). Currently, the default devices are hard coded if the rootfs image 
> is specified for the container.
> We should allow frameworks to specify additional devices that will be made 
> available to the container. Besides bind mount the device file, the devices 
> cgroup needs to be configured properly to allow access to that device.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7309) Support specifying devices for a container.

2019-04-24 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-7309:
--

Assignee: (was: Chun-Hung Hsiao)

> Support specifying devices for a container.
> ---
>
> Key: MESOS-7309
> URL: https://issues.apache.org/jira/browse/MESOS-7309
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Priority: Major
>  Labels: mesosphere, storage
>
> Some container requires some devices to be available in the container (e.g., 
> /dev/fuse). Currently, the default devices are hard coded if the rootfs image 
> is specified for the container.
> We should allow frameworks to specify additional devices that will be made 
> available to the container. Besides bind mount the device file, the devices 
> cgroup needs to be configured properly to allow access to that device.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8384) Add health check for local resource providers.

2019-04-24 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825274#comment-16825274
 ] 

Chun-Hung Hsiao commented on MESOS-8384:


Since SLRP is now an actor in the agent, maybe the better way is to have health 
check for CSI plugins.

> Add health check for local resource providers.
> --
>
> Key: MESOS-8384
> URL: https://issues.apache.org/jira/browse/MESOS-8384
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Priority: Major
>  Labels: mesosphere-dss-post-ga
>
> Similar to what we do for agent, the resource provider manager needs to 
> health check resource providers and mark it as unreachable if health check 
> timed out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters

2019-04-24 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825261#comment-16825261
 ] 

Joseph Wu commented on MESOS-9740:
--

Yes.  We expect the upgrade to work for most people.  However, our test cluster 
had a relatively wide variety of tasks; and just a single bad framework, 
launching 1+ task on each agent, could cripple the upgrade.

I should clarify that this affects 1.8.x **masters**.  A 1.7.x agent _might_ 
have trouble registering with a 1.8.x master due to this bug.

> Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents 
> from reregistering with 1.8+ masters
> ---
>
> Key: MESOS-9740
> URL: https://issues.apache.org/jira/browse/MESOS-9740
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Joseph Wu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: foundations, mesosphere
>
> As part of MESOS-6874, the master now validates protobuf unions passed as 
> part of an {{ExecutorInfo::ContainerInfo}}.  This prevents a task from 
> specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the 
> {{docker}} field (which is then ignored by the agent).
> However, if a task was already launched with an invalid protobuf union, the 
> same validation will happen when the agent tries to reregister with the 
> master.  In this case, if the master is upgraded to validate protobuf unions, 
> the agent reregistration will be rejected.
> {code}
> master.cpp:7201] Dropping re-registration of agent at 
> slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: 
> Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the 
> field `docker` set.
> {code}
> This bug was found when upgrading a 1.7.x test cluster to 1.8.0.  When 
> MESOS-6874 was committed, I had assumed the invalid protobufs would be rare.  
> However, on the test cluster, 13/17 agents had at least one invalid 
> ContainerInfo when reregistering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode

2019-04-24 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825251#comment-16825251
 ] 

Andrei Budnik commented on MESOS-9718:
--

Hi [~QuellaZhang],

Just verified your patch in our internal CI - LGTM!

BTW, could these tests be compiled if you remove only u8 prefix for string 
literals? E.g.,  use
"~~~\u00ff\u00ff\u00ff\u00ff"
instead of
u8"~~~\u00ff\u00ff\u00ff\u00ff" (or "~~~\xC3\xBF\xC3\xBF\xC3\xBF\xC3\xBF")


Would you like to send a PR for the patch on [https://github.com/apache/mesos]?
[http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr]

> Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode
> --
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
> Attachments: mesos.patch.txt
>
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const char8_t [16]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: 
> 'AssertSomeEq': no matching overloaded function found
>