from:"Jan Schlicht \(JIRA\)"

[jira] [Commented] (MESOS-9969) Agent crashes when trying to clean up volue

2019-09-18 Thread Jan Schlicht (Jira)



[ 
https://issues.apache.org/jira/browse/MESOS-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932173#comment-16932173
 ] 

Jan Schlicht commented on MESOS-9969:
-

This looks like MESOS-9966.

> Agent crashes when trying to clean up volue
> ---
>
> Key: MESOS-9969
> URL: https://issues.apache.org/jira/browse/MESOS-9969
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.8.2
>Reporter: Tomas Barton
>Priority: Major
>
> {code}
> Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081748 21828 
> linux_launcher.cpp:650] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/370ed262-4041-4180-a7e1-9ea78070e3a6'
> Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081876 21832 
> containerizer.cpp:2907] Checkpointing termination state to nested container's 
> runtime directory 
> '/var/run/mesos/containers/8e3997e7-c53a-4043-9a7e-26a2e436a041/containers/ae0bdc6d-c738-4352-b5d4-7572182671d5/termination'
> Sep 17 13:49:26 w03 mesos-agent[21803]: mesos-agent: 
> /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:120: T& 
> Option::get() & [with T = std::basic_string]: Assertion `isSome()' 
> failed.
> Sep 17 13:49:26 w03 mesos-agent[21803]: *** Aborted at 1568728166 (unix time) 
> try "date -d @1568728166" if you are using GNU date ***
> Sep 17 13:49:26 w03 mesos-agent[21803]: W0917 13:49:26.082281 21835 
> disk.cpp:453] Ignoring cleanup for unknown container 
> a9ba6959-ea02-4543-b7d5-92a63940
> Sep 17 13:49:26 w03 mesos-agent[21803]: PC: @ 0x7f16a3867fff (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: *** SIGABRT (@0x552b) received by PID 
> 21803 (TID 0x7f169e47d700) from PID 21803; stack trace: ***
> Sep 17 13:49:26 w03 mesos-agent[21803]: E0917 13:49:26.082608 21835 
> memory.cpp:501] Listening on OOM events failed for container 
> a9ba6959-ea02-4543-b7d5-92a63940: Event listener is terminating
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3be50e0 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3867fff (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a386942a (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860e67 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.083741 21835 
> linux.cpp:1074] Unmounting volume 
> '/var/lib/mesos/slave/slaves/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-S17/frameworks/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-0003/executors/es01__coordinator__8591ac8e-3d9d-45ac-bb68-bee379c8c4a4/runs/a9ba6959-ea02-4543-b7d5-92a63940/container-path'
>  for con
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860f12 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7654f13 
> _ZNR6OptionISsE3getEv.part.152
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7666b2f 
> mesos::internal::slave::MesosContainerizerProcess::__destroy()
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a861cb41 
> process::ProcessBase::consume()
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a8633c9c 
> process::ProcessManager::resume()
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a86398a6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a43c6200 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3bdb4a4 start_thread
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a391dd0f (unknown)
> Sep 17 13:49:26 w03 systemd[1]: dcos-mesos-slave.service: Main process 
> exited, code=killed, status=6/ABRT
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well

2019-09-18 Thread Jan Schlicht (Jira)



[ 
https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932152#comment-16932152
 ] 

Jan Schlicht commented on MESOS-9966:
-

You're right, the flag is enabled.

> Agent crashes when trying to destroy orphaned nested container if root 
> container is orphaned as well
> 
>
> Key: MESOS-9966
> URL: https://issues.apache.org/jira/browse/MESOS-9966
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.3
>Reporter: Jan Schlicht
>Assignee: Qian Zhang
>Priority: Major
>
> Noticed an agent crash-looping when trying to recover. It recognized a 
> container and its nested container as orphaned. When trying to destroy the 
> nested container, the agent crashes. Probably when trying to [get the sandbox 
> path of the root 
> container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966].
> {noformat}
> 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] 
> Recovering Linux launcher
> 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] 
> Recovered container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97
> 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] 
> Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] 
> Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] 
> Recovered container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436
> 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] 
> Recovering isolators
> 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started 
> listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started 
> listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] 
> Recovering provisioner
> 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] 
> Successfully loaded 64 Docker images
> 2019-09-09 05:04:26: I0909 05:04:26.388420

[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well

2019-09-18 Thread Jan Schlicht (Jira)



[ 
https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932150#comment-16932150
 ] 

Jan Schlicht commented on MESOS-9966:
-

According to the stack trace we are hitting the code. Let me double-check if 
{{gc_non_executor_container_sandboxes}} is enabled.

> Agent crashes when trying to destroy orphaned nested container if root 
> container is orphaned as well
> 
>
> Key: MESOS-9966
> URL: https://issues.apache.org/jira/browse/MESOS-9966
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.3
>Reporter: Jan Schlicht
>Assignee: Qian Zhang
>Priority: Major
>
> Noticed an agent crash-looping when trying to recover. It recognized a 
> container and its nested container as orphaned. When trying to destroy the 
> nested container, the agent crashes. Probably when trying to [get the sandbox 
> path of the root 
> container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966].
> {noformat}
> 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] 
> Recovering Linux launcher
> 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] 
> Recovered container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97
> 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] 
> Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] 
> Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] 
> Recovered container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436
> 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] 
> Recovering isolators
> 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started 
> listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started 
> listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] 
> Recovering provisioner
> 2019-09-09 05:04:26: I0909 05:04:26.388226 90010

[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well

2019-09-17 Thread Jan Schlicht (Jira)



[ 
https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931187#comment-16931187
 ] 

Jan Schlicht commented on MESOS-9966:
-

The flag wasn't set so it's at its default value which is {{false}}.

> Agent crashes when trying to destroy orphaned nested container if root 
> container is orphaned as well
> 
>
> Key: MESOS-9966
> URL: https://issues.apache.org/jira/browse/MESOS-9966
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.3
>Reporter: Jan Schlicht
>Assignee: Qian Zhang
>Priority: Major
>
> Noticed an agent crash-looping when trying to recover. It recognized a 
> container and its nested container as orphaned. When trying to destroy the 
> nested container, the agent crashes. Probably when trying to [get the sandbox 
> path of the root 
> container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966].
> {noformat}
> 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] 
> Recovering Linux launcher
> 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] 
> Recovered container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97
> 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] 
> Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] 
> Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] 
> Recovered container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436
> 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not 
> recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not 
> recovering cgroup 
> mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos
> 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] 
> 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] 
> a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is 
> a known orphaned container
> 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] 
> Recovering isolators
> 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started 
> listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> a127917b-96fe-4100-b73d-5f876ce9ffc1
> 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started 
> listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started 
> listening on 'low' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started 
> listening on 'medium' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started 
> listening on 'critical' memory pressure events for container 
> 2ee154e2-3cc4-420a-99fb-065e740f3091
> 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] 
> Recovering provisioner
> 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] 
> Successfully loaded 64 Docker images
> 2019-09-09

[jira] [Created] (MESOS-9968) WWWAuthenticate header parsing fails when commas are in (quoted) realm

2019-09-17 Thread Jan Schlicht (Jira)

Jan Schlicht created MESOS-9968:
---

 Summary: WWWAuthenticate header parsing fails when commas are in 
(quoted) realm
 Key: MESOS-9968
 URL: https://issues.apache.org/jira/browse/MESOS-9968
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API, libprocess
Reporter: Jan Schlicht


This was discovered when trying to launch the 
{{[nvcr.io/nvidia/tensorflow:19.08-py3|http://nvcr.io/nvidia/tensorflow:19.08-py3]}}
 image using the Mesos containerizer. This launch fails with
{noformat}
Failed to launch container: Failed to get WWW-Authenticate header: Unexpected 
auth-param format: 
'realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull' in 
'realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push;'
{noformat}
This is because the [header tokenization in 
libprocess|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L640]
 can't handle commas in quoted realm values.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well

2019-09-13 Thread Jan Schlicht (Jira)

Jan Schlicht created MESOS-9966:
---

 Summary: Agent crashes when trying to destroy orphaned nested 
container if root container is orphaned as well
 Key: MESOS-9966
 URL: https://issues.apache.org/jira/browse/MESOS-9966
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.7.3
Reporter: Jan Schlicht


Noticed an agent crash-looping when trying to recover. It recognized a 
container and its nested container as orphaned. When trying to destroy the 
nested container, the agent crashes. Probably when trying to [get the sandbox 
path of the root 
container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966].

{noformat}
2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] 
Recovering Linux launcher
2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not 
recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos
2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] 
Recovered container 
a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97
2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not 
recovering cgroup 
mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos
2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] 
Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091
2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] 
Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1
2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] 
Recovered container 
2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436
2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not 
recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos
2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not 
recovering cgroup 
mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos
2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] 
2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is a 
known orphaned container
2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] 
a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container
2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] 
2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container
2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] 
a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is a 
known orphaned container
2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] 
Recovering isolators
2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started 
listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1
2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started 
listening on 'low' memory pressure events for container 
a127917b-96fe-4100-b73d-5f876ce9ffc1
2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started 
listening on 'medium' memory pressure events for container 
a127917b-96fe-4100-b73d-5f876ce9ffc1
2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started 
listening on 'critical' memory pressure events for container 
a127917b-96fe-4100-b73d-5f876ce9ffc1
2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started 
listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091
2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started 
listening on 'low' memory pressure events for container 
2ee154e2-3cc4-420a-99fb-065e740f3091
2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started 
listening on 'medium' memory pressure events for container 
2ee154e2-3cc4-420a-99fb-065e740f3091
2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started 
listening on 'critical' memory pressure events for container 
2ee154e2-3cc4-420a-99fb-065e740f3091
2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] 
Recovering provisioner
2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] 
Successfully loaded 64 Docker images
2019-09-09 05:04:26: I0909 05:04:26.388420 89932 provisioner.cpp:494] 
Provisioner recovery complete
2019-09-09 05:04:26: I0909 05:04:26.388530 90003 containerizer.cpp:1203] 
Cleaning up orphan container 
a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97
2019-09-09 05:04:26: I0909 05:04:26.388562 90003 containerizer.cpp:2520] 
Destroying container 
a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 in 
RUNNING state
2019-09-09 05:04:26: I0909 05:04:26.388576 90003

[jira] [Created] (MESOS-9885) Resource provider configuration are only removing its container, causing issues in failover scenarios

2019-07-10 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9885:
---

 Summary: Resource provider configuration are only removing its 
container, causing issues in failover scenarios
 Key: MESOS-9885
 URL: https://issues.apache.org/jira/browse/MESOS-9885
 Project: Mesos
  Issue Type: Bug
  Components: resource provider
Affects Versions: 1.8.0
Reporter: Jan Schlicht


An agent could crash while it is handling a {{REMOVE_RESOURCE_PROVIDER_CONFIG}} 
call. In that case, the resource provider won't be removed. This is because its 
configuration is only removed if the actual resource provider container has 
been stopped. I.e. in {{LocalResourceProviderDaemonProcess::remove}} {{os::rm}} 
is only called if {{cleanupContainers}} was successful. After agent failover, 
the resource provider will still be running. This can be a problem for 
frameworks/operators, because there isn't a feedback channel that informs them 
if their removal requests was successful or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9743) Argument forwaring in CMake build result in glog 0.4.0 build as shared library

2019-04-25 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825847#comment-16825847
 ] 

Jan Schlicht commented on MESOS-9743:
-

cc [~asekretenko]
Looks like this change was intended? In https://reviews.apache.org/r/70387/ the 
imported location is changed from {{glog}} to {{libglog}}, i.e. from a static 
to a dynamic library. In that case, it's probably related to the Ninja build 
system and a byproduct isn't copied. But then, building with 
{{BUILD_SHARED_LIBS=ON}} will cause problems, because GLog would be build as 
static lib and we expect a dynamic library now.

> Argument forwaring in CMake build result in glog 0.4.0 build as shared library
> --
>
> Key: MESOS-9743
> URL: https://issues.apache.org/jira/browse/MESOS-9743
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake
>Affects Versions: 1.8.0
> Environment: macOS 10.14.4, clang 8.0.0, Ninja build system
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: build, easyfix, mesosphere, triaged
>
> GLog versions >= 0.3.5 introduces a {{BUILD_SHARED_LIBS}} CMake option. The 
> CMake configuration of Mesos also has such an option. Because these options 
> are forwarded to third-party packages, GLog will be build as a shared library 
> if Mesos is build with {{BUILD_SHARED_LIBS=OFF}}. This is not intended, as in 
> that case the GLog shared library is not copied over, resulting in Mesos 
> binaries failing to start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9743) Argument forwaring in CMake build result in glog 0.4.0 build as shared library

2019-04-25 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9743:
---

 Summary: Argument forwaring in CMake build result in glog 0.4.0 
build as shared library
 Key: MESOS-9743
 URL: https://issues.apache.org/jira/browse/MESOS-9743
 Project: Mesos
  Issue Type: Bug
  Components: cmake
Affects Versions: 1.8.0
 Environment: macOS 10.14.4, clang 8.0.0
Reporter: Jan Schlicht
Assignee: Jan Schlicht


GLog versions >= 0.3.5 introduces a {{BUILD_SHARED_LIBS}} CMake option. The 
CMake configuration of Mesos also has such an option. Because these options are 
forwarded to third-party packages, GLog will be build as a shared library if 
Mesos is build with {{BUILD_SHARED_LIBS=OFF}}. This is not intended, as in that 
case the GLog shared library is not copied over, resulting in Mesos binaries 
failing to start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9594) Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is flaky.

2019-04-11 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815177#comment-16815177
 ] 

Jan Schlicht commented on MESOS-9594:
-

While trying to reproduce this locally, running
{noformat}
stress-ng --cpu=100 --io 20 --vm 20 --fork 100 --timeout 3600s &
GLOG_v=1 src/mesos-tests --verbose 
--gtest_filter=*RetryRpcWithExponentialBackoff --gtest_repeat=-1 
--gtest_break_on_failure
{noformat}
this crashes in a similar manner as reported in MESOS-9712. Log: 
[^RetryRpcWithExponentialBackoff-segfault.txt] 

> Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is 
> flaky.
> 
>
> Key: MESOS-9594
> URL: https://issues.apache.org/jira/browse/MESOS-9594
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Reporter: Chun-Hung Hsiao
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: flaky-test, mesosphere, storage
> Attachments: RetryRpcWithExponentialBackoff-badrun.txt, 
> RetryRpcWithExponentialBackoff-segfault.txt
>
>
> Observed on ASF CI:
> {noformat}
> /tmp/SRC/src/tests/storage_local_resource_provider_tests.cpp:5027
> Failed to wait 1mins for offers
> {noformat}
> Full log:  [^RetryRpcWithExponentialBackoff-badrun.txt] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9712) StorageLocalResourceProviderTest.CsiPluginRpcMetrics is flaky

2019-04-09 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9712:
---

 Summary: StorageLocalResourceProviderTest.CsiPluginRpcMetrics is 
flaky
 Key: MESOS-9712
 URL: https://issues.apache.org/jira/browse/MESOS-9712
 Project: Mesos
  Issue Type: Bug
  Components: storage
 Environment: Debian 9, Mesos configured with SSL support
Reporter: Jan Schlicht


>From an internal CI run:
{noformat}
[ RUN  ] StorageLocalResourceProviderTest.CsiPluginRpcMetrics
06:56:26 I0409 06:56:26.350445 23181 cluster.cpp:176] Creating default 'local' 
authorizer
06:56:26 malloc_consolidate(): invalid chunk size
06:56:26 *** Aborted at 1554792986 (unix time) try "date -d @1554792986" if you 
are using GNU date ***
06:56:26 PC: @ 0x7f1cf4481f3b (unknown)
06:56:26 *** SIGABRT (@0x5a8d) received by PID 23181 (TID 0x7f1ce9be8700) from 
PID 23181; stack trace: ***
06:56:26 @ 0x7f1cf461b8e0 __GI___pthread_rwlock_rdlock
06:56:26 @ 0x7f1cf4481f3b (unknown)
06:56:26 @ 0x7f1cf44832f1 (unknown)
06:56:26 @ 0x7f1cf44c4867 (unknown)
06:56:26 @ 0x7f1cf44cae0a (unknown)
06:56:26 @ 0x7f1cf44cb10e (unknown)
06:56:26 @ 0x7f1cf44cddad (unknown)
06:56:26 @ 0x7f1cf44cf7dd (unknown)
06:56:26 @ 0x7f1cf4a647a8 (unknown)
06:56:26 @ 0x7f1cf88d0805 google::LogMessage::Init()
06:56:26 @ 0x7f1cf88d10ac google::LogMessage::LogMessage()
06:56:26 @ 0x7f1cf752a46a mesos::internal::master::Master::initialize()
06:56:26 @ 0x7f1cf882bd72 process::ProcessManager::resume()
06:56:26 @ 0x7f1cf88303c6 
_ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
06:56:26 @ 0x7f1cf4a8ee6f (unknown)
06:56:26 @ 0x7f1cf4610f2a (unknown)
06:56:26 @ 0x7f1cf4543edf (unknown)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9594) Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is flaky.

2019-03-11 Thread Jan Schlicht (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-9594:
---

Assignee: Jan Schlicht

> Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is 
> flaky.
> 
>
> Key: MESOS-9594
> URL: https://issues.apache.org/jira/browse/MESOS-9594
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Reporter: Chun-Hung Hsiao
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: flaky-test, mesosphere, storage
> Attachments: RetryRpcWithExponentialBackoff-badrun.txt
>
>
> Observed on ASF CI:
> {noformat}
> /tmp/SRC/src/tests/storage_local_resource_provider_tests.cpp:5027
> Failed to wait 1mins for offers
> {noformat}
> Full log:  [^RetryRpcWithExponentialBackoff-badrun.txt] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9612) Resource provider manager assumes all operations are triggered by frameworks

2019-03-04 Thread Jan Schlicht (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-9612:
---

Assignee: Jan Schlicht

> Resource provider manager assumes all operations are triggered by frameworks
> 
>
> Key: MESOS-9612
> URL: https://issues.apache.org/jira/browse/MESOS-9612
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Bannier
>Assignee: Jan Schlicht
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-ga, storage
>
> When the agent tries to apply an operation to resource provider resources, it 
> invokes {{ResourceProviderManager::applyOperation}} which in turn invokes 
> {{ResourceProviderManagerProcess::applyOperation}}. That function currently 
> assumes that the received message contains a valid {{FrameworkID}},
> {noformat}
>  void ResourceProviderManagerProcess::applyOperation(
>   const ApplyOperationMessage& message)   
>   
>   
>   {
> const Offer::Operation& operation = message.operation_info(); 
>   
>   
> 
> const FrameworkID& frameworkId = message.framework_id(); // 
> `framework_id` is `optional`.
> {noformat}
> Since {{FrameworkID}} is not a trivial proto types, but instead one with a 
> {{required}} field {{value}}, the message composed with the {{frameworkId}} 
> below cannot be serialized which leads to a failure below which in turn 
> triggers a {{CHECK}} failure in the agent's function interfacing with the 
> manager.
> A typical scenario where we would want to support operator API calls here is 
> to destroy leftover persistent volumes or reservations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9631) MasterLoadTest.SimultaneousBatchedRequests segfaults on macOS

2019-03-04 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9631:
---

 Summary: MasterLoadTest.SimultaneousBatchedRequests segfaults on 
macOS
 Key: MESOS-9631
 URL: https://issues.apache.org/jira/browse/MESOS-9631
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: macOS Mojave 10.14.3
Reporter: Jan Schlicht


Also tested on Linux, where this test succeeds. {{GLOG_v=1}} output of this 
test on macOS:
{noformat}
I0304 09:33:08.532002 155725824 master.cpp:414] Master 
8be09e79-ff3b-49bf-86e9-cde00fbdcdaa (172.18.8.49) started on 172.18.8.49:56584
I0304 09:33:08.532045 155725824 master.cpp:417] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/uCWwLH/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/uCWwLH/master"
 --zk_session_timeout="10secs"
I0304 09:33:08.532878 155725824 master.cpp:466] Master only allowing 
authenticated frameworks to register
I0304 09:33:08.532889 155725824 master.cpp:472] Master only allowing 
authenticated agents to register
I0304 09:33:08.532896 155725824 master.cpp:478] Master only allowing 
authenticated HTTP frameworks to register
I0304 09:33:08.532903 155725824 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/uCWwLH/credentials'
I0304 09:33:08.533071 155725824 master.cpp:522] Using default 'crammd5' 
authenticator
I0304 09:33:08.533094 155725824 authenticator.cpp:520] Initializing server SASL
I0304 09:33:08.551656 155725824 auxprop.cpp:73] Initialized in-memory auxiliary 
property plugin
I0304 09:33:08.551702 155725824 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0304 09:33:08.551745 155725824 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0304 09:33:08.551766 155725824 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0304 09:33:08.551785 155725824 master.cpp:603] Authorization enabled
I0304 09:33:08.551923 154116096 whitelist_watcher.cpp:77] No whitelist given
I0304 09:33:08.551964 151969792 hierarchical.cpp:208] Initialized hierarchical 
allocator process
I0304 09:33:08.553930 151969792 master.cpp:2103] Elected as the leading master!
I0304 09:33:08.553966 151969792 master.cpp:1638] Recovering from registrar
I0304 09:33:08.554018 153579520 registrar.cpp:339] Recovering registrar
I0304 09:33:08.556378 155725824 registrar.cpp:383] Successfully fetched the 
registry (0B) in 2.342912ms
I0304 09:33:08.556512 155725824 registrar.cpp:487] Applied 1 operations in 
38854ns; attempting to update the registry
I0304 09:33:08.558737 153579520 registrar.cpp:544] Successfully updated the 
registry in 2.206976ms
I0304 09:33:08.558776 153579520 registrar.cpp:416] Successfully recovered 
registrar
I0304 09:33:08.55 153042944 master.cpp:1752] Recovered 0 agents from the 
registry (136B); allowing 10mins for agents to reregister
I0304 09:33:08.558929 155725824 hierarchical.cpp:248] Skipping recovery of 
hierarchical allocator: nothing to recover
I0304 09:33:08.561846 162198976 sched.cpp:232] Version: 1.8.0
I0304 09:33:08.562060 155189248 sched.cpp:336] New master detected at 
master@172.18.8.49:56584
I0304 09:33:08.562099 155189248 sched.cpp:401] Authenticating with master 
master@172.18.8.49:56584
I0304 09:33:08.562110 155189248 sched.cpp:408] Using default CRAM-MD5 
authenticatee
I0304 09:33:08.562196

[jira] [Assigned] (MESOS-9521) MasterAPITest.OperationUpdatesUponAgentGone is flaky

2019-01-15 Thread Jan Schlicht (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-9521:
---

Assignee: Benno Evers

> MasterAPITest.OperationUpdatesUponAgentGone is flaky
> 
>
> Key: MESOS-9521
> URL: https://issues.apache.org/jira/browse/MESOS-9521
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.8.0
> Environment: Fedora28, cmake w/ SSL
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky, flaky-test
>
> The recently added test {{MasterAPITest.OperationUpdatesUponAgentGone}} is 
> flaky, e.g.,
> {noformat}../src/tests/api_tests.cpp:5051: Failure
> Value of: resources.empty()
>   Actual: true
> Expected: false
> ../3rdparty/libprocess/src/../include/process/gmock.hpp:504: Failure
> Actual function call count doesn't match EXPECT_CALL(filter->mock, filter(to, 
> testing::A()))...
> Expected args: message matcher (32-byte object  24-00 00-00 00-00 00-00 24-00 00-00 00-00 00-00 41-63 74-75 61-6C 20-66>, 
> 1-byte object )
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> {noformat}
> I am able to reproduce this reliable in less than 10 iterations when running 
> the test in repetition under additional system stress.
> Even if the test does not fail it produces the following gmock warning,
> {noformat}
> GMOCK WARNING:
> Uninteresting mock function call - returning directly.
> Function call: disconnected()
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9520) IOTest.Read hangs on Windows

2019-01-14 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9520:
---

 Summary: IOTest.Read hangs on Windows
 Key: MESOS-9520
 URL: https://issues.apache.org/jira/browse/MESOS-9520
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: Windows
Reporter: Jan Schlicht


Noticed in test runs that {{IOTest.Read}} hangs in Windows environments. Test 
runs need to be aborted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9480) Master may skip processing authorization results for `LAUNCH_GROUP`.

2018-12-17 Thread Jan Schlicht (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-9480:
---

Assignee: Chun-Hung Hsiao  (was: Jan Schlicht)

> Master may skip processing authorization results for `LAUNCH_GROUP`.
> 
>
> Key: MESOS-9480
> URL: https://issues.apache.org/jira/browse/MESOS-9480
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere
>
> If there is a validation error for {{LAUNCH_GROUP}}, or if there are multiple 
> authorization errors for some of the tasks in a {{LAUNCH_GROUP}}, the master 
> will skip processing the remaining authorization results, which would result 
> in these authorization results being examined by subsequent operations 
> incorrectly:
> https://github.com/apache/mesos/blob/3ade731d0c1772206c4afdf56318cfab6356acee/src/master/master.cpp#L5487-L5521



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-15 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579747#comment-16579747
 ] 

Jan Schlicht edited comment on MESOS-8568 at 8/15/18 12:19 PM:
---

-No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This 
particular 500 return code is actually a no-op in the containerizer. We don't 
need to call {{WAIT_NESTED_CONTAINER}} here.-


was (Author: nfnt):
No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This 
particular 500 return code is actually a no-op in the containerizer. We don't 
need to call {{WAIT_NESTED_CONTAINER}} here.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-15 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580929#comment-16580929
 ] 

Jan Schlicht commented on MESOS-8568:
-

Scratch my older comment. {{REMOVE_NESTED_CONTAINER}} has to called on a 
destroyed container, because as part of this call, the containers runtime 
directory will be removed. I.e., if this call isn't successful, it will leak 
the containers runtime directory. This is the case in the scenario above. 
Hence, the checker has to call {{WAIT_NESTED_CONTAINER}} to make sure that it's 
not calling {{REMOVE_NESTED_CONTAINER}} on a container that is currently being 
destroyed.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9153) Failures when isolating cgroups can leak containers

2018-08-14 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9153:
---

 Summary: Failures when isolating cgroups can leak containers
 Key: MESOS-9153
 URL: https://issues.apache.org/jira/browse/MESOS-9153
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: Jan Schlicht
 Attachments: health_check_leak.txt

When the isolation of cgroups fail (e.g., if cgroup hierarchies changed, as 
described in [MESOS-3488|https://issues.apache.org/jira/browse/MESOS-3488]) 
this will lead to a leaked container. Maybe only for nested container. The 
attached log is a {{VLOG(2)}} logs of a nested container that's started as part 
of a command health check for Kafka. I've removed all log lines unrelated to 
this container. Also, the cgroup hierarchy has been manipulated, to run into 
MESOS-3488.

The linux launcher fails while the containerizer is in {{ISOLATING}} state. The 
containerizer transitions to {{DESTROYING}} and tries to cleanup the isolators. 
The isolators ignore the cleanup requests, because the container ID seems to be 
unknown to them. In case of the Linux Filesystem Isolator, this leads to the 
container directory not getting cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-14 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579747#comment-16579747
 ] 

Jan Schlicht commented on MESOS-8568:
-

No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This 
particular 500 return code is actually a no-op in the containerizer. We don't 
need to call {{WAIT_NESTED_CONTAINER}} here.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-14 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579656#comment-16579656
 ] 

Jan Schlicht commented on MESOS-8568:
-

I've linked MESOS-9131, as it's very similar: Calling 
{{REMOVE_NESTED_CONTAINER}} while that container is being destroyed seems to 
result in a race condition, though it isn't yet clear why.

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-08-03 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9131:
---

 Summary: Health checks launching nested containers while a 
container is being destroyed lead to unkillable tasks
 Key: MESOS-9131
 URL: https://issues.apache.org/jira/browse/MESOS-9131
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Jan Schlicht


A container might get stuck in {{DESTROYING}} state if there's a command health 
check that starts new nested containers while its parent container is getting 
destroyed.

Here are some logs which unrelated lines removed. The 
`REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping 
afterwards.
{noformat}
2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] 
Container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has 
exited
2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] 
Destroying container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in 
RUNNING state
2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] 
Transitioning the state of container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 from 
RUNNING to DESTROYING
2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] Asked 
to destroy container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] Using 
freezer to destroy cgroup 
mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing 
cgroup 
/sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] Successfully 
froze cgroup 
/sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
 after 3.814144ms
2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing 
cgroup 
/sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
2018-04-16 12:37:54: I0416 12:37:54.256599  3853 cgroups.cpp:1444] Successfully 
thawed cgroup 
/sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
 after 5.977856ms
...
2018-04-16 12:37:54: I0416 12:37:54.371117  3837 http.cpp:3502] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
2018-04-16 12:37:54: W0416 12:37:54.371692  3842 http.cpp:2758] Failed to 
launch container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd:
 Parent container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is in 
'DESTROYING' state
2018-04-16 12:37:54: W0416 12:37:54.371826  3840 containerizer.cpp:2337] 
Attempted to destroy unknown container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
...
2018-04-16 12:37:55: I0416 12:37:55.504456  3856 http.cpp:3078] Processing 
REMOVE_NESTED_CONTAINER call for container 
'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
...
2018-04-16 12:37:55: I0416 12:37:55.556367  3857 http.cpp:3502] Processing 
LAUNCH_NESTED_CONTAINER_SESSION call for container 
'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211'
...
2018-04-16 12:37:55: W0416 12:37:55.582137  3850 http.cpp:2758] Failed to 
launch container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211:
 Parent container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is in 
'DESTROYING' state
...
2018-04-16 12:37:55: W0416 12:37:55.583330  3844 containerizer.cpp:2337] 
Attempted to destroy unknown container 
db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211
...
{noformat}

This stops when the framework reconciles and instructs Mesos to kill the task. 
Which also results in a
{noformat}
2018-04-16 13:06:04: I0416 13:06:04.161623  3843 http.cpp:2966] Processing 
KILL_NESTED_CONTAINER call for container 
'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133'
{noformat}
Nothing else related to this

[jira] [Assigned] (MESOS-9094) On macOS libprocess_tests fail to link when compiling with gRPC

2018-07-24 Thread Jan Schlicht (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-9094:
---

Assignee: Jan Schlicht

> On macOS libprocess_tests fail to link when compiling with gRPC
> ---
>
> Key: MESOS-9094
> URL: https://issues.apache.org/jira/browse/MESOS-9094
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.6 with clang 6.0.1.
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
> Fix For: 1.7.0
>
>
> Seems like this was introduces with commit 
> {{a211b4cadf289168464fc50987255d883c226e89}}. Linking {{libprocess-tests}} on 
> macOS with enabled gRPC fails with
> {noformat}
> Undefined symbols for architecture x86_64:
>   
> "grpc::TimePoint std::__1::chrono::duration > > 
> >::you_need_a_specialization_of_TimePoint()", referenced from:
>   process::Future > 
> process::grpc::client::Runtime::call,
>  std::__1::default_delete > > 
> (tests::PingPong::Stub::*)(grpc::ClientContext*, tests::Ping const&, 
> grpc::CompletionQueue*), tests::Ping, tests::Pong, 
> 0>(process::grpc::client::Connection const&, 
> std::__1::unique_ptr, 
> std::__1::default_delete > > 
> (tests::PingPong::Stub::*&&)(grpc::ClientContext*, tests::Ping const&, 
> grpc::CompletionQueue*), tests::Ping&&, process::grpc::client::CallOptions 
> const&)::'lambda'(tests::Ping const&, bool, 
> grpc::CompletionQueue*)::operator()(tests::Ping const&, bool, 
> grpc::CompletionQueue*) const in libprocess_tests-grpc_tests.o
> ld: symbol(s) not found for architecture x86_64
> clang-6.0: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9094) On macOS libprocess_tests fail to link when compiling with gRPC

2018-07-19 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548922#comment-16548922
 ] 

Jan Schlicht commented on MESOS-9094:
-

cc [~chhsia0]. Found https://grpc.io/grpc/cpp/classgrpc_1_1_time_point.html 
which seems to be related.

> On macOS libprocess_tests fail to link when compiling with gRPC
> ---
>
> Key: MESOS-9094
> URL: https://issues.apache.org/jira/browse/MESOS-9094
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.6 with clang 6.0.1.
>Reporter: Jan Schlicht
>Priority: Major
> Fix For: 1.7.0
>
>
> Seems like this was introduces with commit 
> {{a211b4cadf289168464fc50987255d883c226e89}}. Linking {{libprocess-tests}} on 
> macOS with enabled gRPC fails with
> {noformat}
> Undefined symbols for architecture x86_64:
>   
> "grpc::TimePoint std::__1::chrono::duration > > 
> >::you_need_a_specialization_of_TimePoint()", referenced from:
>   process::Future > 
> process::grpc::client::Runtime::call,
>  std::__1::default_delete > > 
> (tests::PingPong::Stub::*)(grpc::ClientContext*, tests::Ping const&, 
> grpc::CompletionQueue*), tests::Ping, tests::Pong, 
> 0>(process::grpc::client::Connection const&, 
> std::__1::unique_ptr, 
> std::__1::default_delete > > 
> (tests::PingPong::Stub::*&&)(grpc::ClientContext*, tests::Ping const&, 
> grpc::CompletionQueue*), tests::Ping&&, process::grpc::client::CallOptions 
> const&)::'lambda'(tests::Ping const&, bool, 
> grpc::CompletionQueue*)::operator()(tests::Ping const&, bool, 
> grpc::CompletionQueue*) const in libprocess_tests-grpc_tests.o
> ld: symbol(s) not found for architecture x86_64
> clang-6.0: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9094) On macOS libprocess_tests fail to link when compiling with gRPC

2018-07-19 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9094:
---

 Summary: On macOS libprocess_tests fail to link when compiling 
with gRPC
 Key: MESOS-9094
 URL: https://issues.apache.org/jira/browse/MESOS-9094
 Project: Mesos
  Issue Type: Bug
 Environment: macOS 10.13.6 with clang 6.0.1.
Reporter: Jan Schlicht
 Fix For: 1.7.0


Seems like this was introduces with commit 
{{a211b4cadf289168464fc50987255d883c226e89}}. Linking {{libprocess-tests}} on 
macOS with enabled gRPC fails with
{noformat}
Undefined symbols for architecture x86_64:
  "grpc::TimePoint > > 
>::you_need_a_specialization_of_TimePoint()", referenced from:
  process::Future > 
process::grpc::client::Runtime::call,
 std::__1::default_delete > > 
(tests::PingPong::Stub::*)(grpc::ClientContext*, tests::Ping const&, 
grpc::CompletionQueue*), tests::Ping, tests::Pong, 
0>(process::grpc::client::Connection const&, 
std::__1::unique_ptr, 
std::__1::default_delete > > 
(tests::PingPong::Stub::*&&)(grpc::ClientContext*, tests::Ping const&, 
grpc::CompletionQueue*), tests::Ping&&, process::grpc::client::CallOptions 
const&)::'lambda'(tests::Ping const&, bool, 
grpc::CompletionQueue*)::operator()(tests::Ping const&, bool, 
grpc::CompletionQueue*) const in libprocess_tests-grpc_tests.o
ld: symbol(s) not found for architecture x86_64
clang-6.0: error: linker command failed with exit code 1 (use -v to see 
invocation)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7441) RegisterSlaveValidationTest.DropInvalidRegistration is flaky

2018-07-03 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531286#comment-16531286
 ] 

Jan Schlicht commented on MESOS-7441:
-

Reopened, as there was a recent test run (on {{master}}, SHA {{b50f6c8a}}) 
failing on CentOS 6 with
{noformat}
[ RUN  ] RegisterSlaveValidationTest.DropInvalidRegistration
I0703 11:44:46.746553 16172 cluster.cpp:173] Creating default 'local' authorizer
I0703 11:44:46.747535 16196 master.cpp:463] Master 
cce3860c-7d4f-4996-b865-fc8ce8302705 (ip-172-16-10-44.ec2.internal) started on 
172.16.10.44:33909
I0703 11:44:46.747611 16196 master.cpp:466] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/dwPsJP/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/dwPsJP/master" --zk_session_timeout="10secs"
I0703 11:44:46.747733 16196 master.cpp:515] Master only allowing authenticated 
frameworks to register
I0703 11:44:46.747748 16196 master.cpp:521] Master only allowing authenticated 
agents to register
I0703 11:44:46.747754 16196 master.cpp:527] Master only allowing authenticated 
HTTP frameworks to register
I0703 11:44:46.747761 16196 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/dwPsJP/credentials'
I0703 11:44:46.747872 16196 master.cpp:571] Using default 'crammd5' 
authenticator
I0703 11:44:46.747907 16196 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0703 11:44:46.747944 16196 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0703 11:44:46.747967 16196 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0703 11:44:46.747997 16196 master.cpp:652] Authorization enabled
I0703 11:44:46.748157 16194 hierarchical.cpp:177] Initialized hierarchical 
allocator process
I0703 11:44:46.748183 16194 whitelist_watcher.cpp:77] No whitelist given
I0703 11:44:46.748715 16196 master.cpp:2162] Elected as the leading master!
I0703 11:44:46.748736 16196 master.cpp:1717] Recovering from registrar
I0703 11:44:46.748950 16196 registrar.cpp:339] Recovering registrar
I0703 11:44:46.749035 16196 registrar.cpp:383] Successfully fetched the 
registry (0B) in 68864ns
I0703 11:44:46.749059 16196 registrar.cpp:487] Applied 1 operations in 5058ns; 
attempting to update the registry
I0703 11:44:46.749349 16196 registrar.cpp:544] Successfully updated the 
registry in 275968ns
I0703 11:44:46.749385 16196 registrar.cpp:416] Successfully recovered registrar
I0703 11:44:46.749465 16196 master.cpp:1831] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to reregister
I0703 11:44:46.749589 16196 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
W0703 11:44:46.751214 16172 process.cpp:2824] Attempted to spawn already 
running process files@172.16.10.44:33909
I0703 11:44:46.751505 16172 containerizer.cpp:300] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0703 11:44:46.753739 16172 linux_launcher.cpp:146] Using /cgroup/freezer as 
the freezer hierarchy for the Linux launcher
I0703 11:44:46.754091 16172 provisioner.cpp:298] Using default backend 'copy'
I0703 11:44:46.754447 16172 cluster.cpp:479] Creating default 'local' authorizer
I0703 11:44:46.754907 16195 slave.cpp:268] Mesos agent started on 
(361)@172.16.10.44:33909
I0703 11:44:46.754920 16195 slave.cpp:269] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/RegisterSlaveValidationTest_DropInvalidRegistration_W7jYUL/store/appc"
 --authenticate_http_executors="true" --authenticate_http_readonly="true"

[jira] [Created] (MESOS-9045) LogZooKeeperTest.WriteRead can segfault

2018-07-02 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9045:
---

 Summary: LogZooKeeperTest.WriteRead can segfault
 Key: MESOS-9045
 URL: https://issues.apache.org/jira/browse/MESOS-9045
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.1
 Environment: macOS
Reporter: Jan Schlicht


The following segfault occured when testing the {{1.5.x}} branch (SHA 
{{64341865d}}) on macOS:
{noformat}
[ RUN  ] LogZooKeeperTest.WriteRead
I0702 00:49:46.259831 2560127808 jvm.cpp:590] Looking up method 
(Ljava/lang/String;)V
I0702 00:49:46.260002 2560127808 jvm.cpp:590] Looking up method deleteOnExit()V
I0702 00:49:46.260550 2560127808 jvm.cpp:590] Looking up method 
(Ljava/io/File;Ljava/io/File;)V
log4j:WARN No appenders could be found for logger 
(org.apache.zookeeper.server.persistence.FileTxnSnapLog).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
I0702 00:49:46.305560 2560127808 jvm.cpp:590] Looking up method ()V
I0702 00:49:46.306149 2560127808 jvm.cpp:590] Looking up method 
(Lorg/apache/zookeeper/server/persistence/FileTxnSnapLog;Lorg/apache/zookeeper/server/ZooKeeperServer$DataTreeBuilder;)V
I0702 00:49:46.07 2560127808 jvm.cpp:590] Looking up method ()V
I0702 00:49:46.343977 2560127808 jvm.cpp:590] Looking up method (I)V
I0702 00:49:46.344200 2560127808 jvm.cpp:590] Looking up method 
configure(Ljava/net/InetSocketAddress;I)V
I0702 00:49:46.357642 2560127808 jvm.cpp:590] Looking up method 
startup(Lorg/apache/zookeeper/server/ZooKeeperServer;)V
I0702 00:49:46.437831 2560127808 jvm.cpp:590] Looking up method getClientPort()I
I0702 00:49:46.437893 2560127808 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 54057
I0702 00:49:46.438153 2560127808 log_tests.cpp:2468] Using temporary directory 
'/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/LogZooKeeperTest_WriteRead_AKZArL'
I0702 00:49:46.440680 2560127808 leveldb.cpp:174] Opened db in 2.415822ms
I0702 00:49:46.441301 2560127808 leveldb.cpp:181] Compacted db in 584251ns
I0702 00:49:46.441349 2560127808 leveldb.cpp:196] Created db iterator in 20482ns
I0702 00:49:46.441380 2560127808 leveldb.cpp:202] Seeked to beginning of db in 
14577ns
I0702 00:49:46.441407 2560127808 leveldb.cpp:277] Iterated through 0 keys in 
the db in 16622ns
I0702 00:49:46.441447 2560127808 replica.cpp:795] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0702 00:49:46.441737 207974400 leveldb.cpp:310] Persisting metadata (8 bytes) 
to leveldb took 157037ns
I0702 00:49:46.441764 207974400 replica.cpp:322] Persisted replica status to 
VOTING
I0702 00:49:46.443361 2560127808 leveldb.cpp:174] Opened db in 1.305425ms
I0702 00:49:46.443821 2560127808 leveldb.cpp:181] Compacted db in 448477ns
I0702 00:49:46.443871 2560127808 leveldb.cpp:196] Created db iterator in 12681ns
I0702 00:49:46.443889 2560127808 leveldb.cpp:202] Seeked to beginning of db in 
13291ns
I0702 00:49:46.443914 2560127808 leveldb.cpp:277] Iterated through 0 keys in 
the db in 14460ns
I0702 00:49:46.443944 2560127808 replica.cpp:795] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0702 00:49:46.444277 206901248 leveldb.cpp:310] Persisting metadata (8 bytes) 
to leveldb took 234740ns
I0702 00:49:46.444317 206901248 replica.cpp:322] Persisted replica status to 
VOTING
I0702 00:49:46.445854 2560127808 leveldb.cpp:174] Opened db in 1.253613ms
I0702 00:49:46.446967 2560127808 leveldb.cpp:181] Compacted db in 1.096521ms
I0702 00:49:46.447022 2560127808 leveldb.cpp:196] Created db iterator in 14312ns
I0702 00:49:46.447048 2560127808 leveldb.cpp:202] Seeked to beginning of db in 
16620ns
I0702 00:49:46.447077 2560127808 leveldb.cpp:277] Iterated through 1 keys in 
the db in 21267ns
I0702 00:49:46.447113 2560127808 replica.cpp:795] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@765: Client 
environment:os.arch=17.4.0
2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
I0702 00:49:46.447453 206901248 log.cpp:108] Attempting to join replica to 
ZooKeeper group
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@766: Client

[jira] [Created] (MESOS-9044) DefaultExecutorTest.ROOT_ContainerStatusForTask can segfault

2018-07-02 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-9044:
---

 Summary: DefaultExecutorTest.ROOT_ContainerStatusForTask can 
segfault
 Key: MESOS-9044
 URL: https://issues.apache.org/jira/browse/MESOS-9044
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.5.1
 Environment: Ubuntu 16.04
Reporter: Jan Schlicht


The following segfault occured when testing the {{1.5.x}} branch (SHA 
{{64341865d}}) on Ubuntu 16.04:
{noformat}
[ RUN  ] 
MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0
I0702 08:32:25.241318 17172 cluster.cpp:172] Creating default 'local' authorizer
I0702 08:32:25.242328  6510 master.cpp:457] Master 
be25b90e-f63d-4935-aaf3-cacfc7faacbf (ip-172-16-10-86.ec2.internal) started on 
172.16.10.86:32891
I0702 08:32:25.242413  6510 master.cpp:459] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/I9TI6h/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/I9TI6h/master" --zk_session_timeout="10secs"
I0702 08:32:25.242554  6510 master.cpp:508] Master only allowing authenticated 
frameworks to register
I0702 08:32:25.242564  6510 master.cpp:514] Master only allowing authenticated 
agents to register
I0702 08:32:25.242570  6510 master.cpp:520] Master only allowing authenticated 
HTTP frameworks to register
I0702 08:32:25.242575  6510 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/I9TI6h/credentials'
I0702 08:32:25.242677  6510 master.cpp:564] Using default 'crammd5' 
authenticator
I0702 08:32:25.242728  6510 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0702 08:32:25.242780  6510 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0702 08:32:25.242830  6510 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0702 08:32:25.242864  6510 master.cpp:643] Authorization enabled
I0702 08:32:25.243048  6507 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I0702 08:32:25.243223  6507 whitelist_watcher.cpp:77] No whitelist given
I0702 08:32:25.243743  6510 master.cpp:2210] Elected as the leading master!
I0702 08:32:25.243768  6510 master.cpp:1690] Recovering from registrar
I0702 08:32:25.243832  6511 registrar.cpp:347] Recovering registrar
I0702 08:32:25.244055  6511 registrar.cpp:391] Successfully fetched the 
registry (0B) in 124928ns
I0702 08:32:25.244096  6511 registrar.cpp:495] Applied 1 operations in 8690ns; 
attempting to update the registry
I0702 08:32:25.244261  6511 registrar.cpp:552] Successfully updated the 
registry in 146944ns
I0702 08:32:25.244302  6511 registrar.cpp:424] Successfully recovered registrar
I0702 08:32:25.244416  6511 master.cpp:1803] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to re-register
I0702 08:32:25.244556  6505 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W0702 08:32:25.246150 17172 process.cpp:2759] Attempted to spawn already 
running process files@172.16.10.86:32891
I0702 08:32:25.246560 17172 containerizer.cpp:304] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0702 08:32:25.250222 17172 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0702 08:32:25.250689 17172 provisioner.cpp:299] Using default backend 'overlay'
I0702 08:32:25.251200 17172 cluster.cpp:460] Creating default 'local' authorizer
I0702 08:32:25.251788  6509 slave.cpp:262] Mesos agent started on 
(996)@172.16.10.86:32891
I0702 08:32:25.251878  6509 slave.cpp:263] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://;

[jira] [Commented] (MESOS-8985) Posting to the operator api with 'accept recordio' header can crash the agent

2018-06-11 Thread Jan Schlicht (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507953#comment-16507953
 ] 

Jan Schlicht commented on MESOS-8985:
-

This is caused by {{Content-Type}} being (in Mesos terms) non-streaming type, 
while {{Accept}} indicates a streaming type. We don't cover this case in the 
current code, make some wrong assumptions and finally erroneously try to 
serialize to RecordIO which isn't supported. 

> Posting to the operator api with 'accept recordio' header can crash the agent
> -
>
> Key: MESOS-8985
> URL: https://issues.apache.org/jira/browse/MESOS-8985
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Philip Norman
>Assignee: Gilbert Song
>Priority: Major
> Attachments: mesos-slave-crash.log
>
>
> It's possible to crash the mesos agent by posting a reasonable request to the 
> operator API.
> h3. Background:
> Sending a request to the v1 api endpoint with an unsupported 'accept' header:
> {code:java}
> curl -X POST http://10.0.3.27:5051/api/v1 \
>   -H 'accept: application/atom+xml' \
>   -H 'content-type: application/json' \
>   -d '{"type":"GET_CONTAINERS","get_containers":{"show_nested": 
> true,"show_standalone": true}}'{code}
> Results in the following friendly error message:
> {code:java}
> Expecting 'Accept' to allow application/json or application/x-protobuf or 
> application/recordio{code}
> h3. Reproducible crash:
> However, sending the same request with 'application/recordio' 'accept' header:
> {code:java}
> curl -X POST \
> http://10.0.3.27:5051/api/v1 \
>   -H 'accept: application/recordio' \
>   -H 'content-type: application/json' \
>   -d '{"type":"GET_CONTAINERS","get_containers":{"show_nested": 
> true,"show_standalone": true}}'{code}
> causes the agent to crash (no response is received).
> Crash log is shown below, full log from the agent is attached here:
> {code:java}
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> I0607 22:30:32.397320 3743 logfmt.cpp:178] type=audit timestamp=2018-06-07 
> 22:30:32.397243904+00:00 reason="Error in token 'Missing 'Authorization' 
> header from HTTP request'. Allowing anonymous connection" 
> object="/slave(1)/api/v1" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 
> 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 
> Safari/537.36" authorizer="mesos-agent" action="POST" result=allow 
> srcip=10.0.6.99 dstport=5051 srcport=42084 dstip=10.0.3.27
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> W0607 22:30:32.397434 3743 authenticator.cpp:289] Error in token on request 
> from '10.0.6.99:42084': Missing 'Authorization' header from HTTP request
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> W0607 22:30:32.397466 3743 authenticator.cpp:291] Falling back to anonymous 
> connection using user 'dcos_anonymous'
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> I0607 22:30:32.397629 3748 http.cpp:1099] HTTP POST for /slave(1)/api/v1 from 
> 10.0.6.99:42084 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 
> 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 
> Safari/537.36'
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> I0607 22:30:32.397784 3748 http.cpp:2030] Processing GET_CONTAINERS call
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> F0607 22:30:32.398736 3747 http.cpp:121] Serializing a RecordIO stream is not 
> supported
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: 
> *** Check failure stack trace: ***
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f619478636d google::LogMessage::Fail()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f619478819d google::LogMessage::SendToLog()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6194785f5c google::LogMessage::Flush()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6194788a99 google::LogMessageFatal::~LogMessageFatal()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f61935e2b9d mesos::internal::serialize()
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6193a4c0ef 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEERKN4JSON5ArrayEEE10CallableFnIZNK5mesos8internal5slave4Http13getContainersERKNSD_5agent4CallENSD_11ContentTypeERK6OptionINS3_14authentication9PrincipalEEEUlRKNS2_IS7_EEE0_EclES9_
> Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ 
> 0x7f6193a81d61

[jira] [Assigned] (MESOS-7329) Authorize offer operations for converting disk resources

2018-06-06 Thread Jan Schlicht (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-7329:
---

Assignee: Jan Schlicht

> Authorize offer operations for converting disk resources
> 
>
> Key: MESOS-7329
> URL: https://issues.apache.org/jira/browse/MESOS-7329
> Project: Mesos
>  Issue Type: Task
>  Components: master, security
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: csi-post-mvp, mesosphere, security, storage
>
> All offer operations are authorized, hence authorization logic has to be 
> added to new offer operations as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8896) 'ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors' is flaky

2018-05-09 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8896:
---

 Summary: 'ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors' 
is flaky
 Key: MESOS-8896
 URL: https://issues.apache.org/jira/browse/MESOS-8896
 Project: Mesos
  Issue Type: Bug
  Components: flaky
Reporter: Jan Schlicht


This was a test failure on macOS with SSL enabled. Not sure yet if other 
systems might be affected as well:
{noformat}
[ RUN  ] ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors
I0509 01:36:35.181434 2992141120 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 58450
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@765: Client 
environment:os.arch=17.4.0
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 
2017; root:xnu-4570.41.2~1/RELEASE_X86_64
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:58450 sessionTimeout=1 
watcher=0x1148b6680 sessionId=0 sessionPasswd= context=0x7fe697de7590 
flags=0
2018-05-09 01:36:35,182:44641(0x7aa42000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:58450]
2018-05-09 01:36:35,185:44641(0x7aa42000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:58450], 
sessionId=0x163440b82ec, negotiated timeout=1
I0509 01:36:35.186167 167882752 group.cpp:341] Group process 
(zookeeper-group(14)@10.0.49.4:57595) connected to ZooKeeper
I0509 01:36:35.186213 167882752 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I0509 01:36:35.186226 167882752 group.cpp:395] Authenticating with ZooKeeper 
using digest
2018-05-09 
01:36:38,534:44641(0x7aa42000):ZOO_INFO@auth_completion_func@1327: 
Authentication scheme digest succeeded
I0509 01:36:38.534493 167882752 group.cpp:419] Trying to create path '/mesos' 
in ZooKeeper
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@765: Client 
environment:os.arch=17.4.0
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 
2017; root:xnu-4570.41.2~1/RELEASE_X86_64
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build
2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:58450 sessionTimeout=1 
watcher=0x1148b6680 sessionId=0 sessionPasswd= context=0x7fe6999c1fe0 
flags=0
I0509 01:36:38.540652 166273024 contender.cpp:152] Joining the ZK group
2018-05-09 01:36:38,540:44641(0x7b463000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:58450]
2018-05-09 01:36:38,542:44641(0x7b463000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:58450], 
sessionId=0x163440b82ec0001, negotiated timeout=1
I0509 01:36:38.542425 168955904 group.cpp:341] Group process 
(zookeeper-group(15)@10.0.49.4:57595) connected to ZooKeeper
I0509 01:36:38.542466 168955904 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I0509 01:36:38.542480 168955904 group.cpp:395] Authenticating with ZooKeeper 
using digest
2018-05-09 01:36:50,559:44641(0x7aa42000):ZOO_WARN@zookeeper_interest@1597: 
Exceeded deadline by 8687ms
2018-05-09

[jira] [Created] (MESOS-8868) Some 'FsTest' test cases fail on macOS

2018-05-02 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8868:
---

 Summary: Some 'FsTest' test cases fail on macOS
 Key: MESOS-8868
 URL: https://issues.apache.org/jira/browse/MESOS-8868
 Project: Mesos
  Issue Type: Bug
 Environment: macOS 10.13.4, clang 6.0.0.
Reporter: Jan Schlicht


These tests fail in {{674db615971d2288ffdd1b64f2be93367e03a63d}}:
{noformat}
[ RUN  ] FsTest.CreateDirectoryAtMaxPath
../../../3rdparty/stout/tests/os/filesystem_tests.cpp:243: Failure
Value of: (os::realpath(testfile)).get()
  Actual: 
"/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/FlHiuR//file.txt"
Expected: testfile
Which is: 
"/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/FlHiuR//file.txt"
[  FAILED  ] FsTest.CreateDirectoryAtMaxPath (1 ms)
[ RUN  ] FsTest.CreateDirectoryLongerThanMaxPath
../../../3rdparty/stout/tests/os/filesystem_tests.cpp:267: Failure
Value of: (os::realpath(testfile)).get()
  Actual: 
"/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/tQjz6A/87efabe7-c026-4d44-9174-7ffaffe92aea/fdf3029c-3ccb-472a-91a9-79c56a114f0a/33b71897-2b23-4546-83f1-f77132e48b86/7548fb65-fa84-4260-80ff-a4d9133e5fe3/221b923d-ddc3-473e-a19a-a18863985401/03e8e58d-80a1-40db-8091-3676c5ecba05/file.txt"
Expected: testfile
Which is: 
"/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/tQjz6A/87efabe7-c026-4d44-9174-7ffaffe92aea/fdf3029c-3ccb-472a-91a9-79c56a114f0a/33b71897-2b23-4546-83f1-f77132e48b86/7548fb65-fa84-4260-80ff-a4d9133e5fe3/221b923d-ddc3-473e-a19a-a18863985401/03e8e58d-80a1-40db-8091-3676c5ecba05/file.txt"
[  FAILED  ] FsTest.CreateDirectoryLongerThanMaxPath (1 ms)
[ RUN  ] FsTest.RealpathValidationOnOpenFile
../../../3rdparty/stout/tests/os/filesystem_tests.cpp:286: Failure
Value of: (os::realpath(file)).get()
  Actual: 
"/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/k9wmip/b44085df-3da8-4799-9893-80ad4e007a80"
Expected: file
Which is: 
"/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/k9wmip/b44085df-3da8-4799-9893-80ad4e007a80"
[  FAILED  ] FsTest.RealpathValidationOnOpenFile (0 ms)
{noformat}

Seems like a regression introduced in stout changes that started with 
{{8b7798f31ea37077e5091d279fcf352a01577366}}.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8867) CMake: Bundled libevent v2.1.5-beta doesn't compile with OpenSSL 1.1.0

2018-05-02 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8867:
---

 Summary: CMake: Bundled libevent v2.1.5-beta doesn't compile with 
OpenSSL 1.1.0
 Key: MESOS-8867
 URL: https://issues.apache.org/jira/browse/MESOS-8867
 Project: Mesos
  Issue Type: Bug
  Components: cmake
 Environment: Fedora 28 with OpenSSL 1.1.0h, {{cmake -G Ninja -D 
ENABLE_LIBEVENT=ON -D ENABLE_SSL=ON}}
Reporter: Jan Schlicht


Compiling libevent 2.1.5 beta with OpenSSL 1.1.0 fails with errors like
{noformat}
/home/vagrant/mesos/build/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
 In function ‘bio_bufferevent_new’:
/home/vagrant/mesos/build/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
 error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
  b->init = 0;
   ^~
{noformat}

As this is the version currently bundled by CMake, builds with 
{{ENABLE_LIBEVENT=ON, ENABLE_SSL=ON}} will fail to compile.

Libevent supports OpenSSL 1.1.0 beginning with v2.1.7-rc (see 
https://github.com/libevent/libevent/pull/397) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8866) CMake builds are missing byproduct declaration for jemalloc.

2018-05-02 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8866:
---

 Summary: CMake builds are missing byproduct declaration for 
jemalloc.
 Key: MESOS-8866
 URL: https://issues.apache.org/jira/browse/MESOS-8866
 Project: Mesos
  Issue Type: Bug
  Components: cmake
 Environment: Cmake with {{-G Ninja}} and {{-D 
ENABLE_JEMALLOC_ALLOCATOR=ON}}.
Reporter: Jan Schlicht
Assignee: Jan Schlicht


The {{jemalloc}} dependency is missing a byproduct declaration in the CMake 
configuration. As a result, building Mesos with enabled {{jemalloc}} using 
CMake and Ninja will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7854) Authorize resource calls to provider manager api

2018-04-26 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453760#comment-16453760
 ] 

Jan Schlicht commented on MESOS-7854:
-

Closing this in favor of MESOS-8774, as that ticket is more specific.

> Authorize resource calls to provider manager api
> 
>
> Key: MESOS-7854
> URL: https://issues.apache.org/jira/browse/MESOS-7854
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: csi-post-mvp, mesosphere, storage
>
> The resource provider manager provides a function
> {code}
> process::Future api(
> const process::http::Request& request,
> const Option& principal) const;
> {code}
> which is exposed e.g., as an agent endpoint.
> We need to add authorization to this function in order to e.g., stop rough 
> callers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8774) Authenticate and authorize calls to the resource provider manager's API

2018-04-26 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8774:
---

Assignee: Jan Schlicht

> Authenticate and authorize calls to the resource provider manager's API 
> 
>
> Key: MESOS-8774
> URL: https://issues.apache.org/jira/browse/MESOS-8774
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Benjamin Bannier
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
>
> The resource provider manager is exposed via an agent endpoint against which 
> resource providers subscribe or perform other actions. We should authenticate 
> and authorize any interactions there.
> Since currently local resource providers run on agents who manages their 
> lifetime it seems natural to extend the framework used for executor 
> authentication to resource providers as well. The agent would then generate a 
> secret token whenever a new resource provider is started and inject it into 
> the resource providers it launches. Resource providers in turn would use this 
> token when interacting with the manager API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8818) VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS

2018-04-23 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447800#comment-16447800
 ] 

Jan Schlicht commented on MESOS-8818:
-

cc [~jpe...@apache.org]

> VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS
> ---
>
> Key: MESOS-8818
> URL: https://issues.apache.org/jira/browse/MESOS-8818
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: macOS 10.13.4
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
>
> This test fails on macOS with:
> {noformat}
> [ RUN  ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume
> I0423 10:55:19.624977 2767623040 containerizer.cpp:296] Using isolation { 
> environment_secret, filesystem/posix, volume/sandbox_path }
> I0423 10:55:19.625176 2767623040 provisioner.cpp:299] Using default backend 
> 'copy'
> ../../src/tests/containerizer/volume_sandbox_path_isolator_tests.cpp:130: 
> Failure
> create: Unknown or unsupported isolator 'volume/sandbox_path'
> [  FAILED  ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume (3 ms)
> {noformat}
> Likely a regression introduced in commit 
> {{189efed864ca2455674b0790d6be4a73c820afd6}} which removed 
> {{volume/sandbox_path}} for POSIX.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8818) VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS

2018-04-23 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8818:
---

 Summary: VolumeSandboxPathIsolatorTest.SharedParentTypeVolume 
fails on macOS
 Key: MESOS-8818
 URL: https://issues.apache.org/jira/browse/MESOS-8818
 Project: Mesos
  Issue Type: Bug
  Components: containerization
 Environment: macOS 10.13.4
Reporter: Jan Schlicht
Assignee: Jan Schlicht


This test fails on macOS with:
{noformat}
[ RUN  ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume
I0423 10:55:19.624977 2767623040 containerizer.cpp:296] Using isolation { 
environment_secret, filesystem/posix, volume/sandbox_path }
I0423 10:55:19.625176 2767623040 provisioner.cpp:299] Using default backend 
'copy'
../../src/tests/containerizer/volume_sandbox_path_isolator_tests.cpp:130: 
Failure
create: Unknown or unsupported isolator 'volume/sandbox_path'
[  FAILED  ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume (3 ms)
{noformat}

Likely a regression introduced in commit 
{{189efed864ca2455674b0790d6be4a73c820afd6}} which removed 
{{volume/sandbox_path}} for POSIX.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8521) Various IOSwitchboard related tests fail on macOS High Sierra.

2018-04-10 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431966#comment-16431966
 ] 

Jan Schlicht commented on MESOS-8521:
-

Can also confirm that I'm no longer getting these failures on 10.13.4 using 
LLVM 6.0.0.

> Various IOSwitchboard related tests fail on macOS High Sierra. 
> ---
>
> Key: MESOS-8521
> URL: https://issues.apache.org/jira/browse/MESOS-8521
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.2 (17C88)
> Apple LLVM version 9.0.0 (clang-900.0.39.2)
>Reporter: Till Toenshoff
>Priority: Major
>
> The problem appears to cause several switchboard tests to fail. Note that 
> this problem does not manifest on older Apple systems.
> The failure rate on this system is 100%.
> List of currently failing tests:
> {noformat}
> IOSwitchboardTest.ContainerAttach
> IOSwitchboardTest.ContainerAttachAfterSlaveRestart
> IOSwitchboardTest.OutputRedirectionWithTTY
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1
> {noformat}
> This is an example using {{GLOG=v1}} verbose logging:
> {noformat}
> [ RUN  ] IOSwitchboardTest.ContainerAttach
> I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { 
> environment_secret, filesystem/posix, posix/cpu }
> I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend 
> 'copy'
> I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering 
> containerizer
> I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery 
> complete
> I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed 
> ContainerConfig at 
> '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config'
> I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to 
> PREPARING
> I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo 
> terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8"
>  --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" 
> --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" 
> --wait_for_connection="false"' for container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard 
> server (pid: 83716) listening on socket file 
> '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for 
> container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"sleep 
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}]},"task_environment":{},"tty_slave_path":"\/dev\/ttys003","working_directory":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}"
>  --pipe_read="7" --pipe_write="10" 
> --runtime_directory="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad"'
> I0201 03:02:51.949144 106336256 launcher.cpp:140] Forked child with pid 
> '83717' for container '1b1af888-9e39-4c13-a647-ac43c0df9fad'
> I0201 03:02:51.949896 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PREPARING to 
> ISOLATING
> I0201 03:02:51.951071 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from ISOLATING to 
> FETCHING
> I0201 03:02:51.951190 108482560 fetcher.cpp:369] Starting to fetch URIs for 
> container: 1b1af888-9e39-4c13-a647-ac43c0df9fad, directory: 
> /var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_W9gDw0
> I0201 03:02:51.951791 109019136 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from FETCHING to 
> RUNNING
> I0201 03:02:52.076602 106872832 containerizer.cpp:2338] Destroying container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad in RUNNING state
> I0201 03:02:52.076644 106872832 containerizer.cpp:2952] Transitioning the 
> state of

[jira] [Assigned] (MESOS-3858) Draft quota limits design document

2018-03-26 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-3858:
---

Assignee: (was: Jan Schlicht)

> Draft quota limits design document
> --
>
> Key: MESOS-3858
> URL: https://issues.apache.org/jira/browse/MESOS-3858
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: mesosphere, quota
>
> In the design documents for Quota 
> (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit#)
>  the proposed MVP does not include quota limits. Quota limits represent an 
> upper bound of resources that a role is allowed to use. The task of this 
> ticket is to outline a design document on how to implement quota limits when 
> the quota MVP is implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8720) CSIClientTest segfaults on macOS.

2018-03-22 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8720:
---

 Summary: CSIClientTest segfaults on macOS.
 Key: MESOS-8720
 URL: https://issues.apache.org/jira/browse/MESOS-8720
 Project: Mesos
  Issue Type: Bug
  Components: storage
Affects Versions: 1.6.0
 Environment: macOS 10.13.3, LLVM 6.0.0
Reporter: Jan Schlicht


This seems to be caused by the changes introduced in commit 
{{79c21981803dafd8a5e971b98961487a69017ce9}}. On a macOS build, configured with 
{{--enable-grpc}}, all test cases in {{CSIClientTest}} segfault. Running 
{{src/mesos-tests --gtest_filter=\*CSIClientTest\*}} results in
{noformat}
[ RUN  ] Identity/CSIClientTest.Call/Client_GetSupportedVersions
mesos-tests(57309,0x7fffa0293340) malloc: *** error for object 0x10bb63b68: 
pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
*** Aborted at 1521711802 (unix time) try "date -d @1521711802" if you are 
using GNU date ***
PC: @ 0x7fff6738ce3e __pthread_kill
*** SIGABRT (@0x7fff6738ce3e) received by PID 57309 (TID 0x7fffa0293340) stack 
trace: ***
@ 0x7fff674bef5a _sigtramp
@0x0 (unknown)
@ 0x7fff672e9312 abort
@ 0x7fff673e6866 free
@0x10aec51bd grpc::CompletionQueue::CompletionQueue()
@0x10b2087a4 process::grpc::client::Runtime::Data::Data()
@0x107bd697d mesos::internal::tests::CSIClientTest::CSIClientTest()
@0x107bd68ca 
testing::internal::ParameterizedTestFactory<>::CreateTest()
@0x107c58158 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@0x107c57fd8 testing::TestInfo::Run()
@0x107c588c7 testing::TestCase::Run()
@0x107c612b7 testing::internal::UnitTestImpl::RunAllTests()
@0x107c60d58 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@0x107c60cc8 testing::UnitTest::Run()
@0x106afc83d main
@ 0x7fff6723d115 start
@0x2 (unknown)
Abort trap: 6
{noformat}

Increasing GLog verbosity doesn't provide more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8719) Mesos compiled with `--enable-grpc` doesn't compile on non-Linux builds

2018-03-22 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8719:
---

 Summary: Mesos compiled with `--enable-grpc` doesn't compile on 
non-Linux builds
 Key: MESOS-8719
 URL: https://issues.apache.org/jira/browse/MESOS-8719
 Project: Mesos
  Issue Type: Bug
  Components: storage
Affects Versions: 1.6.0
 Environment: macOS
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Commit {{59cca968e04dee069e0df2663733b6d6f55af0da}} added 
{{examples/test_csi_plugin.cpp}} to non-Linux builds that are configured using 
the {{--enable-grpc}} flag. As {{examples/test_csi_plugin.cpp}} includes 
{{fs/linux.hpp}}, it can only compile on Linux and needs to be disabled for 
non-Linux builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8677) FaulToleranceTest.ReregisterCompletedFrameworks crashes on macOS

2018-03-15 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8677:
---

 Summary: FaulToleranceTest.ReregisterCompletedFrameworks crashes 
on macOS
 Key: MESOS-8677
 URL: https://issues.apache.org/jira/browse/MESOS-8677
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: macOS 10.13.3 with LLVM 6.0.0 as well as with Apple LLVM 
version 9.0.0 (clang-900.0.39.2)
Reporter: Jan Schlicht


Here's a {{GLOG_v=1}} run of the test:
{noformat}
[ RUN  ] FaultToleranceTest.ReregisterCompletedFrameworks
I0314 14:30:11.240077 2290090816 cluster.cpp:172] Creating default 'local' 
authorizer
I0314 14:30:11.241261 55140352 master.cpp:463] Master 
025f775d-9c75-43f6-9ee6-079a605fbf01 (Jenkinss-Mac-mini.local) started on 
10.0.49.4:54648
I0314 14:30:11.241287 55140352 master.cpp:465] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" 
--credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ZyMWb1/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ZyMWb1/master"
 --zk_session_timeout="10secs"
I0314 14:30:11.241439 55140352 master.cpp:514] Master only allowing 
authenticated frameworks to register
I0314 14:30:11.241447 55140352 master.cpp:520] Master only allowing 
authenticated agents to register
I0314 14:30:11.241452 55140352 master.cpp:526] Master only allowing 
authenticated HTTP frameworks to register
I0314 14:30:11.241461 55140352 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ZyMWb1/credentials'
I0314 14:30:11.241678 55140352 master.cpp:570] Using default 'crammd5' 
authenticator
I0314 14:30:11.241739 55140352 http.cpp:957] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0314 14:30:11.241824 55140352 http.cpp:957] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0314 14:30:11.241873 55140352 http.cpp:957] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0314 14:30:11.241919 55140352 master.cpp:649] Authorization enabled
I0314 14:30:11.242066 52457472 whitelist_watcher.cpp:77] No whitelist given
I0314 14:30:11.242079 51920896 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I0314 14:30:11.243557 52994048 master.cpp:2119] Elected as the leading master!
I0314 14:30:11.243574 52994048 master.cpp:1678] Recovering from registrar
I0314 14:30:11.243640 51920896 registrar.cpp:347] Recovering registrar
I0314 14:30:11.243852 52457472 registrar.cpp:391] Successfully fetched the 
registry (0B) in 190976ns
I0314 14:30:11.243928 52457472 registrar.cpp:495] Applied 1 operations in 
28606ns; attempting to update the registry
I0314 14:30:11.244163 52457472 registrar.cpp:552] Successfully updated the 
registry in 194816ns
I0314 14:30:11.244222 52457472 registrar.cpp:424] Successfully recovered 
registrar
I0314 14:30:11.244408 54067200 master.cpp:1792] Recovered 0 agents from the 
registry (155B); allowing 10mins for agents to reregister
I0314 14:30:11.23 52994048 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W0314 14:30:11.247259 2290090816 process.cpp:2805] Attempted to spawn already 
running process files@10.0.49.4:54648
I0314 14:30:11.247681 2290090816 cluster.cpp:460] Creating default 'local' 
authorizer
I0314 14:30:11.248837 55676928 slave.cpp:265] Mesos agent started on 
(50)@10.0.49.4:54648
I0314 14:30:11.248865 55676928 slave.cpp:266] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/FaultToleranceTest_ReregisterCompletedFrameworks_UqvwBG/store/appc"
 --authenticate_http_executors="true"

[jira] [Created] (MESOS-8610) NsTest.SupportedNamespaces fails on CentOS7

2018-02-26 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8610:
---

 Summary: NsTest.SupportedNamespaces fails on CentOS7
 Key: MESOS-8610
 URL: https://issues.apache.org/jira/browse/MESOS-8610
 Project: Mesos
  Issue Type: Bug
Reporter: Jan Schlicht


Failed on a {{GLOG_v=1 src/mesos-tests --verbose}} run with
{noformat}
[ RUN  ] NsTest.SupportedNamespaces
../../src/tests/containerizer/ns_tests.cpp:119: Failure
Value of: (ns::supported(n)).get()
  Actual: false
Expected: true
Which is: true
CLONE_NEWUSER
../../src/tests/containerizer/ns_tests.cpp:124: Failure
Value of: (ns::supported(allNamespaces)).get()
  Actual: false
Expected: true
Which is: true
CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER
[  FAILED  ] NsTest.SupportedNamespaces (0 ms)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8603) SlaveTest.TerminalTaskContainerizerUpdateFailsWithGone and SlaveTest.TerminalTaskContainerizerUpdateFailsWithLost are flaky

2018-02-23 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8603:
---

 Summary: SlaveTest.TerminalTaskContainerizerUpdateFailsWithGone 
and SlaveTest.TerminalTaskContainerizerUpdateFailsWithLost are flaky
 Key: MESOS-8603
 URL: https://issues.apache.org/jira/browse/MESOS-8603
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Jan Schlicht
 Attachments: TerminalTaskContainerizerUpdateFailsWithGone, 
TerminalTaskContainerizerUpdateFailsWithLost

Both tests fail from time to time. Attached are verbose test output of failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8593) Support credential updates in Docker config without restarting the agent

2018-02-19 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8593:
---

 Summary: Support credential updates in Docker config without 
restarting the agent
 Key: MESOS-8593
 URL: https://issues.apache.org/jira/browse/MESOS-8593
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker
Reporter: Jan Schlicht


When using the Mesos containerizer with a private Docker repository with 
{{--docker_config}} option, the repository might expire credentials after some 
time, forcing the user to login again. In that case the Docker config in use 
will change and the agent needs to be restarted to reflect the change. Instead 
of restarting, the agent could reload the Docker config file every time before 
fetching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User

2018-02-15 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365343#comment-16365343
 ] 

Jan Schlicht commented on MESOS-8585:
-

Looks like this has been introduced in https://reviews.apache.org/r/64630/.
cc [~jpe...@apache.org]

> Agent Crashes When Ask to Start Task with Unknown User
> --
>
> Key: MESOS-8585
> URL: https://issues.apache.org/jira/browse/MESOS-8585
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Karsten
>Priority: Major
> Attachments: dcos-mesos-slave.service.1.gz, 
> dcos-mesos-slave.service.2.gz
>
>
> The Marathon team has an integration test that tries to start a task with an 
> unknown user. The test expects a \{{TASK_FAILED}}. However, we see 
> \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent 
> crashes and restarts.
>  
> {code}
>  783 2018-02-14 14:55:45: I0214 14:55:45.319974  6213 slave.cpp:2542] 
> Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for 
> framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
> 784 2018-02-14 14:55:45: I0214 14:55:45.320605  6213 paths.cpp:727] 
> Creating sandbox 
> '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
> 784 
> a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
>  for user 'bad'
> 785 2018-02-14 14:55:45: F0214 14:55:45.321131  6213 paths.cpp:735] 
> CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' 
> Failed to create executor directory '/var/lib/mesos/slave/
> 785 
> slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
> 785 66d4acc88'
> 786 2018-02-14 14:55:45: *** Check failure stack trace: ***
> 787 2018-02-14 14:55:45: @ 0x7f72033444ad  
> google::LogMessage::Fail()
> 788 2018-02-14 14:55:45: @ 0x7f72033462dd  
> google::LogMessage::SendToLog()
> 789 2018-02-14 14:55:45: @ 0x7f720334409c  
> google::LogMessage::Flush()
> 790 2018-02-14 14:55:45: @ 0x7f7203346bd9  
> google::LogMessageFatal::~LogMessageFatal()
> 791 2018-02-14 14:55:45: @ 0x56544ca378f9  
> _CheckFatal::~_CheckFatal()
> 792 2018-02-14 14:55:45: @ 0x7f720270f30d  
> mesos::internal::slave::paths::createExecutorDirectory()
> 793 2018-02-14 14:55:45: @ 0x7f720273812c  
> mesos::internal::slave::Framework::addExecutor()
> 794 2018-02-14 14:55:45: @ 0x7f7202753e35  
> mesos::internal::slave::Slave::__run()
> 795 2018-02-14 14:55:45: @ 0x7f7202764292  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
> 795 
> listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
> 795 
> 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_
> 796 2018-02-14 14:55:45: @ 0x7f72032a2b11  
> process::ProcessBase::consume()
> 797 2018-02-14 14:55:45: @ 0x7f72032b183c  
> process::ProcessManager::resume()
> 798 2018-02-14 14:55:45: @ 0x7f72032b6da6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 799 2018-02-14 14:55:45: @ 0x7f72005ced73  (unknown)
> 800 2018-02-14 14:55:45: @ 0x7f72000cf52c  (unknown)
> 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd  (unknown)
> 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, 
> code=killed, status=6/ABRT
> 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed 
> state.
> 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result 
> 'signal'.
> 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time 
> over, scheduling restart.
> 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel 
> agent.
> 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel 
> agent...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-02-01 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348677#comment-16348677
 ] 

Jan Schlicht commented on MESOS-8424:
-

Only 65043 is merged, the other ones are still in review. Reopening.

> Test that operations are correctly reported following a master failover
> ---
>
> Key: MESOS-8424
> URL: https://issues.apache.org/jira/browse/MESOS-8424
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> As the master keeps track of operations running on a resource provider, it 
> needs to be updated on these operations when agents reregister after a master 
> failover. E.g., an operation that has finished during the failover should be 
> reported as finished by the master after the agent on which the resource 
> provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race

2018-02-01 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8524:

Summary: When `UPDATE_SLAVE` messages are received, offers might not be 
rescinded due to a race   (was: When `UPDATE_SLAVE` messages are received, 
offers might not be recinded due to a race )

> When `UPDATE_SLAVE` messages are received, offers might not be rescinded due 
> to a race 
> ---
>
> Key: MESOS-8524
> URL: https://issues.apache.org/jira/browse/MESOS-8524
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: Master + Agent running with enabled 
> {{RESOURCE_PROVIDER}} capability
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
>
> When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers 
> with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In 
> the master, the agent is added (back) to the allocator, as soon as it's 
> (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an 
> allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} 
> is being handled in the master, these offers have to be rescinded, as they're 
> based on an outdated agent state.
> Internally, the allocator defers a offer callback in the master 
> ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at 
> the same time and its handler in the master called before the offer callback 
> (but after the actual allocation took place). In this case the (outdated) 
> offer is still sent to frameworks and never rescinded.
> Here's the relevant log lines, this was discovered while working on 
> https://reviews.apache.org/r/65045/:
> {noformat}
> I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 704915ns
> I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 
> (172.18.8.20) with total oversubscribed resources {}
> I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to 
> framework 53c557e7-3161-449b-bacc-a4f8c02e78e7- (default) at 
> scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
> I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 
> 40444ns
> I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } 
> (used)
> I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total 
> resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be recinded due to a race

2018-02-01 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8524:
---

 Summary: When `UPDATE_SLAVE` messages are received, offers might 
not be recinded due to a race 
 Key: MESOS-8524
 URL: https://issues.apache.org/jira/browse/MESOS-8524
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Affects Versions: 1.5.0
 Environment: Master + Agent running with enabled {{RESOURCE_PROVIDER}} 
capability
Reporter: Jan Schlicht


When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers with 
the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In the 
master, the agent is added (back) to the allocator, as soon as it's 
(re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an 
allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} 
is being handled in the master, these offers have to be rescinded, as they're 
based on an outdated agent state.
Internally, the allocator defers a offer callback in the master 
({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at 
the same time and its handler in the master called before the offer callback 
(but after the actual allocation took place). In this case the (outdated) offer 
is still sent to frameworks and never rescinded.

Here's the relevant log lines, this was discovered while working on 
https://reviews.apache.org/r/65045/:
{noformat}
I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation for 
1 agents in 704915ns
I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 
53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 
(172.18.8.20) with total oversubscribed resources {}
I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to framework 
53c557e7-3161-449b-bacc-a4f8c02e78e7- (default) at 
scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 
40444ns
I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 
53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } (used)
I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 
53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total 
resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8490) UpdateSlaveMessageWithPendingOffers is flaky.

2018-01-26 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8490:
---

Assignee: Jan Schlicht  (was: Benjamin Bannier)

> UpdateSlaveMessageWithPendingOffers is flaky.
> -
>
> Key: MESOS-8490
> URL: https://issues.apache.org/jira/browse/MESOS-8490
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6 with SSL
> Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: flaky-test
> Attachments: UpdateSlaveMessageWithPendingOffers-badrun1.txt, 
> UpdateSlaveMessageWithPendingOffers-badrun2.txt
>
>
> {noformat}
> ../../src/tests/master_tests.cpp:8728
> Failed to wait 15secs for offers
> {noformat}
> Full logs attached. Log output from two failures looks different, might be an 
> indicator of multiple issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8473) Authorize `GET_OPERATIONS` calls.

2018-01-22 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8473:
---

 Summary: Authorize `GET_OPERATIONS` calls.
 Key: MESOS-8473
 URL: https://issues.apache.org/jira/browse/MESOS-8473
 Project: Mesos
  Issue Type: Task
  Components: agent, master
Reporter: Jan Schlicht


The {{GET_OPERATIONS}} call lists all known operations on a master or agent. 
Authorization has to be added to this call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8445) Test that `UPDATE_STATE` of a resource provider doesn't have unwanted side-effects in master or agent

2018-01-15 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8445:
---

 Summary: Test that `UPDATE_STATE` of a resource provider doesn't 
have unwanted side-effects in master or agent
 Key: MESOS-8445
 URL: https://issues.apache.org/jira/browse/MESOS-8445
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht
Assignee: Jan Schlicht


While we test the correct behavior of {{UPDATE_STATE}} sent by resource 
providers when an operation state changes or after (re-)registration, this call 
might also get sent independent from any such event, e.g., if resources are 
added to a running resource provider. Correct behavior of master and agent need 
to be tested. Outstanding offers should be rescinded and internal states 
updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-01-10 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8424:
---

 Summary: Test that operations are correctly reported following a 
master failover
 Key: MESOS-8424
 URL: https://issues.apache.org/jira/browse/MESOS-8424
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht
Assignee: Jan Schlicht


As the master keeps track of operations running on a resource provider, it 
needs to be updated on these operations when agents reregister after a master 
failover. E.g., an operation that has finished during the failover should be 
reported as finished by the master after the agent on which the resource 
provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-01-10 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8424:

  Sprint: Mesosphere Sprint 72
Story Points: 3

> Test that operations are correctly reported following a master failover
> ---
>
> Key: MESOS-8424
> URL: https://issues.apache.org/jira/browse/MESOS-8424
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>
> As the master keeps track of operations running on a resource provider, it 
> needs to be updated on these operations when agents reregister after a master 
> failover. E.g., an operation that has finished during the failover should be 
> reported as finished by the master after the agent on which the resource 
> provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-8219) Validate that any offer operation is only applied on resources from a single provider

2018-01-02 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307956#comment-16307956
 ] 

Jan Schlicht commented on MESOS-8219:
-

Sure, will work on this.

> Validate that any offer operation is only applied on resources from a single 
> provider
> -
>
> Key: MESOS-8219
> URL: https://issues.apache.org/jira/browse/MESOS-8219
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Benjamin Bannier
>Assignee: Jan Schlicht
>
> Offer operations can only be applied to resources from one single resource 
> provider. A number of places in the implementation assume that the provider 
> ID obtained from any {Resource} in an offer operation is equivalent to the 
> one from any other resource. We should update the master to validate that 
> invariant and reject malformed operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

2017-12-20 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8346:

Shepherd: Benjamin Bannier

> Resubscription of a resource provider will crash the agent if its HTTP 
> connection isn't closed
> --
>
> Key: MESOS-8346
> URL: https://issues.apache.org/jira/browse/MESOS-8346
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Blocker
>  Labels: mesosphere
>
> A resource provider might resubscribe while its old HTTP connection wasn't 
> properly closed. In that case an agent will crashm with, e.g., the following 
> log:
> {noformat}
> I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource 
> provider 
> {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
> I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
> message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
> I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
> resourceProviders.subscribed.contains(resourceProviderId) 
> *** Check failure stack trace: ***
> E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> @0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
> @0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
> I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 61830ns
> I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
> I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13) disconnected
> I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
> @0x115f2761d  
> mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
> @0x115f2977d  
> _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
> @0x115f29740  
> _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_
> @0x115f296bb  
> _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_
> @0x115f2965d  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
> @0x115f29631  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEvOT_DpOT0_
> @

[jira] [Commented] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.

2017-12-20 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298557#comment-16298557
 ] 

Jan Schlicht commented on MESOS-8349:
-

Discarding a {{Future}} (instead of discarding its {{Promise}}) won't call 
{{onAny}} callbacks, only a {{onDiscarded}} callback that we haven't set up 
here.

> When a resource provider driver is disconnected, it fails to reconnect.
> ---
>
> Key: MESOS-8349
> URL: https://issues.apache.org/jira/browse/MESOS-8349
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> If the resource provider manager closes the HTTP connection of a resource 
> provider, the resource provider should reconnect itself. For that, the 
> resource provider driver will change its state to "DISCONNECTED", call a 
> {{disconnected}} callback and use its endpoint detector to reconnect.
> This doesn't work in a testing environment where a 
> {{ConstantEndpointDetector}} is used. While the resource provider is notified 
> of the closed HTTP connection (and logs {{End-Of-File received}}), it never 
> disconnects itself and calls the {{disconnected}} callback. Discarding 
> {{HttpConnectionProcess::detection}} in 
> {{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} 
> callback of that future. This might not be a problem in 
> {{HttpConnectionProcess}} but could be related to the test case using a 
> {{ConstantEndpointDetector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.

2017-12-20 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8349:
---

 Summary: When a resource provider driver is disconnected, it fails 
to reconnect.
 Key: MESOS-8349
 URL: https://issues.apache.org/jira/browse/MESOS-8349
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Jan Schlicht
Assignee: Jan Schlicht


If the resource provider manager closes the HTTP connection of a resource 
provider, the resource provider should reconnect itself. For that, the resource 
provider driver will change its state to "DISCONNECTED", call a 
{{disconnected}} callback and use its endpoint detector to reconnect.
This doesn't work in a testing environment where a {{ConstantEndpointDetector}} 
is used. While the resource provider is notified of the closed HTTP connection 
(and logs {{End-Of-File received}}), it never disconnects itself and calls the 
{{disconnected}} callback. Discarding {{HttpConnectionProcess::detection}} in 
{{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} callback 
of that future. This might not be a problem in {{HttpConnectionProcess}} but 
could be related to the test case using a {{ConstantEndpointDetector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

2017-12-20 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298138#comment-16298138
 ] 

Jan Schlicht commented on MESOS-8346:
-

It will land today, the patch seems to be good, just needs a small update.

> Resubscription of a resource provider will crash the agent if its HTTP 
> connection isn't closed
> --
>
> Key: MESOS-8346
> URL: https://issues.apache.org/jira/browse/MESOS-8346
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Blocker
>  Labels: mesosphere
>
> A resource provider might resubscribe while its old HTTP connection wasn't 
> properly closed. In that case an agent will crashm with, e.g., the following 
> log:
> {noformat}
> I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource 
> provider 
> {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
> I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
> message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
> I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
> resourceProviders.subscribed.contains(resourceProviderId) 
> *** Check failure stack trace: ***
> E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> @0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
> @0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
> I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 61830ns
> I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
> I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13) disconnected
> I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
> @0x115f2761d  
> mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
> @0x115f2977d  
> _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
> @0x115f29740  
> _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_
> @0x115f296bb  
> _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_
> @0x115f2965d  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
> @0x115f29631  
>

[jira] [Created] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

2017-12-19 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8346:
---

 Summary: Resubscription of a resource provider will crash the 
agent if its HTTP connection isn't closed
 Key: MESOS-8346
 URL: https://issues.apache.org/jira/browse/MESOS-8346
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Jan Schlicht
Assignee: Jan Schlicht
Priority: Blocker


A resource provider might resubscribe while its old HTTP connection wasn't 
properly closed. In that case an agent will crashm with, e.g., the following 
log:
{noformat}
I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource provider 
{"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource provider 
8e71beef-796e-4bde-9257-952ed0f230a5
I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource provider 
8e71beef-796e-4bde-9257-952ed0f230a5
E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total resources 
cpus:2; mem:1024; disk:1024; ports:[31000-32000]
F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
resourceProviders.subscribed.contains(resourceProviderId) 
*** Check failure stack trace: ***
E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
@0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
@0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation for 
1 agents in 61830ns
I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
(172.18.8.13) disconnected
I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
(172.18.8.13)
I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
(172.18.8.13)
I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
@0x115f2761d  
mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
@0x115f2977d  
_ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
@0x115f29740  
_ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_
@0x115f296bb  
_ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_
@0x115f2965d  
_ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
@0x115f29631  
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEvOT_DpOT0_
@0x115f29526  
_ZNO6lambda12CallableOnceIFvvEE10CallableFnINS_8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS7_14HttpConnectionERKNS6_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEclEv
@0x10b6ca690  _ZNO6lambda12CallableOnceIFvvEEclEv
@0x10be09295

[jira] [Created] (MESOS-8315) ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider is flaky

2017-12-08 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8315:
---

 Summary: 
ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider is flaky
 Key: MESOS-8315
 URL: https://issues.apache.org/jira/browse/MESOS-8315
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Log from a CI run that failed:
{noformat}
[ RUN  ] 
ContentType/ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider/1
I1208 02:27:51.541087  4488 cluster.cpp:172] Creating default 'local' authorizer
I1208 02:27:51.542224 24578 master.cpp:456] Master 
d29f2eb9-c698-47cb-aea5-56350dd07581 (ip-172-16-10-30.ec2.internal) started on 
172.16.10.30:47245
I1208 02:27:51.542243 24578 master.cpp:458] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/i4FLJ1/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/i4FLJ1/master" 
--zk_session_timeout="10secs"
I1208 02:27:51.542359 24578 master.cpp:507] Master only allowing authenticated 
frameworks to register
I1208 02:27:51.542366 24578 master.cpp:513] Master only allowing authenticated 
agents to register
I1208 02:27:51.542371 24578 master.cpp:519] Master only allowing authenticated 
HTTP frameworks to register
I1208 02:27:51.542376 24578 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/i4FLJ1/credentials'
I1208 02:27:51.542466 24578 master.cpp:563] Using default 'crammd5' 
authenticator
I1208 02:27:51.542503 24578 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1208 02:27:51.542539 24578 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1208 02:27:51.542564 24578 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1208 02:27:51.542593 24578 master.cpp:642] Authorization enabled
I1208 02:27:51.542634 24577 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1208 02:27:51.542667 24577 whitelist_watcher.cpp:77] No whitelist given
I1208 02:27:51.543349 24571 master.cpp:2214] Elected as the leading master!
I1208 02:27:51.543365 24571 master.cpp:1694] Recovering from registrar
I1208 02:27:51.543426 24576 registrar.cpp:347] Recovering registrar
I1208 02:27:51.543519 24576 registrar.cpp:391] Successfully fetched the 
registry (0B) in 0ns
I1208 02:27:51.543546 24576 registrar.cpp:495] Applied 1 operations in 7697ns; 
attempting to update the registry
I1208 02:27:51.543674 24574 registrar.cpp:552] Successfully updated the 
registry in 0ns
I1208 02:27:51.543707 24574 registrar.cpp:424] Successfully recovered registrar
I1208 02:27:51.543820 24571 master.cpp:1807] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to re-register
I1208 02:27:51.543840 24577 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W1208 02:27:51.545620  4488 process.cpp:2756] Attempted to spawn already 
running process files@172.16.10.30:47245
I1208 02:27:51.545984  4488 containerizer.cpp:304] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I1208 02:27:51.549041  4488 linux_launcher.cpp:146] Using /cgroup/freezer as 
the freezer hierarchy for the Linux launcher
I1208 02:27:51.549407  4488 provisioner.cpp:299] Using default backend 'copy'
I1208 02:27:51.549849  4488 cluster.cpp:460] Creating default 'local' authorizer
I1208 02:27:51.550534 24574 slave.cpp:258] Mesos agent started on 
(1222)@172.16.10.30:47245
I1208 02:27:51.550555 24574 slave.cpp:259] Flags at startup: --acls="" 
--agent_features="capabilities {
  type: MULTI_ROLE
}
capabilities {
  type: HIERARCHICAL_ROLE
}
capabilities {
  type: RESERVATION_REFINEMENT
}
capabilities {
  type: RESOURCE_PROVIDER
}
"

[jira] [Created] (MESOS-8314) Add authorization to the `GET_RESOURCE_PROVIDER` v1 API call.

2017-12-08 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8314:
---

 Summary: Add authorization to the `GET_RESOURCE_PROVIDER` v1 API 
call.
 Key: MESOS-8314
 URL: https://issues.apache.org/jira/browse/MESOS-8314
 Project: Mesos
  Issue Type: Task
  Components: HTTP API
Reporter: Jan Schlicht


The {{GET_RESOURCE_PROVIDERS}} call is used to list all resource providers 
known to a Mesos master or agent. This call needs to be authorized.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8309) Introduce a UUID message type

2017-12-07 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8309:
---

 Summary: Introduce a UUID message type
 Key: MESOS-8309
 URL: https://issues.apache.org/jira/browse/MESOS-8309
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht
Assignee: Jan Schlicht
 Fix For: 1.5.0


Currently when UUID need to be part of a protobuf message, we use a byte array 
field for that. This has some drawbacks, especially when it comes to outputting 
the UUID in logs: To stringify the UUID field, we first have to create a stout 
UUID, then call {{.toString()}} of that one. It would help to have a UUID type 
in {{mesos.proto}} and provide a stringification function for it in 
{{type_utils.hpp}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8289) ReservationTest.MasterFailover is flaky when run with `RESOURCE_PROVIDER` capability

2017-12-01 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8289:
---

 Summary: ReservationTest.MasterFailover is flaky when run with 
`RESOURCE_PROVIDER` capability
 Key: MESOS-8289
 URL: https://issues.apache.org/jira/browse/MESOS-8289
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Jan Schlicht
Assignee: Jan Schlicht
 Fix For: 1.5.0


On a system under load, 
{{ResourceProviderCapability/ReservationTest.MasterFailover/1}} can fail. 
{{GLOG_v=2}} of the failure:
{noformat}
[ RUN  ] ResourceProviderCapability/ReservationTest.MasterFailover/1
I1201 14:52:47.324741 122806272 process.cpp:2730] Dropping event for process 
hierarchical-allocator(34)@172.18.8.37:57116
I1201 14:52:47.324816 122806272 process.cpp:2730] Dropping event for process 
slave(17)@172.18.8.37:57116
I1201 14:52:47.324859 2720961344 clock.cpp:331] Clock paused at 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.326314 2720961344 clock.cpp:435] Clock of 
files@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00
I1201 14:52:47.326371 2720961344 clock.cpp:435] Clock of 
hierarchical-allocator(35)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.326539 2720961344 cluster.cpp:170] Creating default 'local' 
authorizer
I1201 14:52:47.326568 2720961344 clock.cpp:435] Clock of 
local-authorizer(52)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.326671 2720961344 clock.cpp:435] Clock of 
standalone-master-detector(52)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.326709 2720961344 clock.cpp:435] Clock of 
in-memory-storage(35)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.326884 2720961344 clock.cpp:435] Clock of 
registrar(35)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00
I1201 14:52:47.327579 2720961344 clock.cpp:435] Clock of 
master@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00
I1201 14:52:47.330301 119050240 master.cpp:454] Master 
209387ca-a9c3-4717-9769-a59d9fe927f1 (172.18.8.37) started on 172.18.8.37:57116
I1201 14:52:47.330329 119050240 master.cpp:456] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="5ms" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" 
--credentials="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/z44iHn/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" --roles="role" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/z44iHn/master"
 --zk_session_timeout="10secs"
I1201 14:52:47.330628 119050240 master.cpp:505] Master only allowing 
authenticated frameworks to register
I1201 14:52:47.330638 119050240 master.cpp:511] Master only allowing 
authenticated agents to register
I1201 14:52:47.330644 119050240 master.cpp:517] Master only allowing 
authenticated HTTP frameworks to register
I1201 14:52:47.330652 119050240 credentials.hpp:37] Loading credentials for 
authentication from 
'/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/z44iHn/credentials'
I1201 14:52:47.330873 119050240 master.cpp:561] Using default 'crammd5' 
authenticator
I1201 14:52:47.330927 119050240 clock.cpp:435] Clock of 
crammd5-authenticator(35)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.330963 119050240 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1201 14:52:47.330993 119050240 clock.cpp:435] Clock of 
__basic_authenticator__(137)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201 14:52:47.331056 119050240 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1201 14:52:47.331082 119050240 clock.cpp:435] Clock of 
__basic_authenticator__(138)@172.18.8.37:57116 updated to 2017-12-01 
13:53:04.834857088+00:00
I1201

[jira] [Created] (MESOS-8270) Add an agent endpoint to list all active resource providers

2017-11-27 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8270:
---

 Summary: Add an agent endpoint to list all active resource 
providers
 Key: MESOS-8270
 URL: https://issues.apache.org/jira/browse/MESOS-8270
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Operators/Frameworks might need information about all resource providers 
currently running on an agent. An API endpoint should provide that information 
and include resource provider name and type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8269) Support resource provider re-subscription in the resource provider manager

2017-11-27 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8269:
---

 Summary: Support resource provider re-subscription in the resource 
provider manager
 Key: MESOS-8269
 URL: https://issues.apache.org/jira/browse/MESOS-8269
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Resource providers may re-subscribe by sending a {{SUBSCRIBE}} call that 
includes a resource provider ID. Support for this has to be added to the 
resource provider manager. E.g., the manager should check if a resource 
provider with the ID exists and use the updated HTTP connection.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8263) ResourceProviderManagerHttpApiTest.ConvertResources is flaky

2017-11-23 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8263:

  Sprint: Mesosphere Sprint 68
Story Points: 2
  Labels: mesosphere test  (was: test)

> ResourceProviderManagerHttpApiTest.ConvertResources is flaky
> 
>
> Key: MESOS-8263
> URL: https://issues.apache.org/jira/browse/MESOS-8263
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, test
>
> From a ASF CI run:
> {noformat}
> 3: [   OK ] 
> ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/0 (1048 ms)
> 3: [ RUN  ] 
> ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/1
> 3: I1123 08:06:04.233137 20036 cluster.cpp:162] Creating default 'local' 
> authorizer
> 3: I1123 08:06:04.237293 20060 master.cpp:448] Master 
> 7c9d8e8c-3fb3-44c5-8505-488ada3e848e (dce3e4c418cb) started on 
> 172.17.0.2:35090
> 3: I1123 08:06:04.237325 20060 master.cpp:450] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/EpiTO7/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/EpiTO7/master" 
> --zk_session_timeout="10secs"
> 3: I1123 08:06:04.237727 20060 master.cpp:499] Master only allowing 
> authenticated frameworks to register
> 3: I1123 08:06:04.237743 20060 master.cpp:505] Master only allowing 
> authenticated agents to register
> 3: I1123 08:06:04.237753 20060 master.cpp:511] Master only allowing 
> authenticated HTTP frameworks to register
> 3: I1123 08:06:04.237764 20060 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/EpiTO7/credentials'
> 3: I1123 08:06:04.238149 20060 master.cpp:555] Using default 'crammd5' 
> authenticator
> 3: I1123 08:06:04.238358 20060 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 3: I1123 08:06:04.238575 20060 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 3: I1123 08:06:04.238764 20060 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 3: I1123 08:06:04.238939 20060 master.cpp:634] Authorization enabled
> 3: I1123 08:06:04.239159 20043 whitelist_watcher.cpp:77] No whitelist given
> 3: I1123 08:06:04.239187 20045 hierarchical.cpp:173] Initialized hierarchical 
> allocator process
> 3: I1123 08:06:04.242822 20041 master.cpp:2215] Elected as the leading master!
> 3: I1123 08:06:04.242857 20041 master.cpp:1695] Recovering from registrar
> 3: I1123 08:06:04.243067 20052 registrar.cpp:347] Recovering registrar
> 3: I1123 08:06:04.243808 20052 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 690944ns
> 3: I1123 08:06:04.243953 20052 registrar.cpp:495] Applied 1 operations in 
> 37370ns; attempting to update the registry
> 3: I1123 08:06:04.244638 20052 registrar.cpp:552] Successfully updated the 
> registry in 620032ns
> 3: I1123 08:06:04.244798 20052 registrar.cpp:424] Successfully recovered 
> registrar
> 3: I1123 08:06:04.245352 20058 hierarchical.cpp:211] Skipping recovery of 
> hierarchical allocator: nothing to recover
> 3: I1123 08:06:04.245358 20057 master.cpp:1808] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> 3: W1123 08:06:04.251852 20036 process.cpp:2756] Attempted to spawn already 
> running process files@172.17.0.2:35090
> 3: I1123 08:06:04.253250 20036 containerizer.cpp:301] Using isolation { 
> environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
> 3: W1123 08:06:04.253965 20036 backend.cpp:76]

[jira] [Created] (MESOS-8263) ResourceProviderManagerHttpApiTest.ConvertResources is flaky

2017-11-23 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8263:
---

 Summary: ResourceProviderManagerHttpApiTest.ConvertResources is 
flaky
 Key: MESOS-8263
 URL: https://issues.apache.org/jira/browse/MESOS-8263
 Project: Mesos
  Issue Type: Bug
  Components: flaky
Reporter: Jan Schlicht
Assignee: Jan Schlicht


>From a ASF CI run:

{noformat}
3: [   OK ] 
ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/0 (1048 ms)
3: [ RUN  ] 
ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/1
3: I1123 08:06:04.233137 20036 cluster.cpp:162] Creating default 'local' 
authorizer
3: I1123 08:06:04.237293 20060 master.cpp:448] Master 
7c9d8e8c-3fb3-44c5-8505-488ada3e848e (dce3e4c418cb) started on 172.17.0.2:35090
3: I1123 08:06:04.237325 20060 master.cpp:450] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/EpiTO7/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/EpiTO7/master" 
--zk_session_timeout="10secs"
3: I1123 08:06:04.237727 20060 master.cpp:499] Master only allowing 
authenticated frameworks to register
3: I1123 08:06:04.237743 20060 master.cpp:505] Master only allowing 
authenticated agents to register
3: I1123 08:06:04.237753 20060 master.cpp:511] Master only allowing 
authenticated HTTP frameworks to register
3: I1123 08:06:04.237764 20060 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/EpiTO7/credentials'
3: I1123 08:06:04.238149 20060 master.cpp:555] Using default 'crammd5' 
authenticator
3: I1123 08:06:04.238358 20060 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
3: I1123 08:06:04.238575 20060 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
3: I1123 08:06:04.238764 20060 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
3: I1123 08:06:04.238939 20060 master.cpp:634] Authorization enabled
3: I1123 08:06:04.239159 20043 whitelist_watcher.cpp:77] No whitelist given
3: I1123 08:06:04.239187 20045 hierarchical.cpp:173] Initialized hierarchical 
allocator process
3: I1123 08:06:04.242822 20041 master.cpp:2215] Elected as the leading master!
3: I1123 08:06:04.242857 20041 master.cpp:1695] Recovering from registrar
3: I1123 08:06:04.243067 20052 registrar.cpp:347] Recovering registrar
3: I1123 08:06:04.243808 20052 registrar.cpp:391] Successfully fetched the 
registry (0B) in 690944ns
3: I1123 08:06:04.243953 20052 registrar.cpp:495] Applied 1 operations in 
37370ns; attempting to update the registry
3: I1123 08:06:04.244638 20052 registrar.cpp:552] Successfully updated the 
registry in 620032ns
3: I1123 08:06:04.244798 20052 registrar.cpp:424] Successfully recovered 
registrar
3: I1123 08:06:04.245352 20058 hierarchical.cpp:211] Skipping recovery of 
hierarchical allocator: nothing to recover
3: I1123 08:06:04.245358 20057 master.cpp:1808] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
3: W1123 08:06:04.251852 20036 process.cpp:2756] Attempted to spawn already 
running process files@172.17.0.2:35090
3: I1123 08:06:04.253250 20036 containerizer.cpp:301] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
3: W1123 08:06:04.253965 20036 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
3: W1123 08:06:04.254109 20036 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
3: I1123 08:06:04.254148 20036 provisioner.cpp:259] Using default backend 'copy'
3: I1123 08:06:04.256542 20036 cluster.cpp:448] Creating default 'local' 
authorizer
3: I1123 08:06:04.260066 20057 slave.cpp:262] Mesos agent started on 
(784)@172.17.0.2:35090
3: I1123

[jira] [Comment Edited] (MESOS-8211) Handle agent local resources in offer operation handler

2017-11-14 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249372#comment-16249372
 ] 

Jan Schlicht edited comment on MESOS-8211 at 11/14/17 2:14 PM:
---

https://reviews.apache.org/r/63751/
https://reviews.apache.org/r/63797/


was (Author: nfnt):
https://reviews.apache.org/r/63751/

> Handle agent local resources in offer operation handler
> ---
>
> Key: MESOS-8211
> URL: https://issues.apache.org/jira/browse/MESOS-8211
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> The master will send {{ApplyOfferOperationMessage}} instead of 
> {{CheckpointResourcesMessage}} when an agent has the 'RESOURCE_PROVIDER' 
> capability set. The agent handler for the message needs to be updated to 
> support operations on agent resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8218) Support `RESERVE`/`CREATE` operations with resource providers

2017-11-14 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8218:

Shepherd: Jie Yu

> Support `RESERVE`/`CREATE` operations with resource providers
> -
>
> Key: MESOS-8218
> URL: https://issues.apache.org/jira/browse/MESOS-8218
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> {{RESERVE}}/{{UNRESERVE}}/{{CREATE}}/{{DESTROY}} operations should work with 
> resource provider resources like they do with agent resources. I.e. they will 
> be speculatively applied and an offer operation will be sent to the 
> respective resource provider.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (MESOS-8218) Support `RESERVE`/`CREATE` operations with resource providers

2017-11-14 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8218:
---

Assignee: Jan Schlicht

> Support `RESERVE`/`CREATE` operations with resource providers
> -
>
> Key: MESOS-8218
> URL: https://issues.apache.org/jira/browse/MESOS-8218
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> {{RESERVE}}/{{UNRESERVE}}/{{CREATE}}/{{DESTROY}} operations should work with 
> resource provider resources like they do with agent resources. I.e. they will 
> be speculatively applied and an offer operation will be sent to the 
> respective resource provider.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8218) Support `RESERVE`/`CREATE` operations with resource providers

2017-11-14 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8218:
---

 Summary: Support `RESERVE`/`CREATE` operations with resource 
providers
 Key: MESOS-8218
 URL: https://issues.apache.org/jira/browse/MESOS-8218
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht


{{RESERVE}}/{{UNRESERVE}}/{{CREATE}}/{{DESTROY}} operations should work with 
resource provider resources like they do with agent resources. I.e. they will 
be speculatively applied and an offer operation will be sent to the respective 
resource provider.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8211) Handle agent local resources in offer operation handler

2017-11-13 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8211:
---

 Summary: Handle agent local resources in offer operation handler
 Key: MESOS-8211
 URL: https://issues.apache.org/jira/browse/MESOS-8211
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Jan Schlicht
Assignee: Jan Schlicht


The master will send {{ApplyOfferOperationMessage}} instead of 
{{CheckpointResourcesMessage}} when an agent has the 'RESOURCE_PROVIDER' 
capability set. The agent handler for the message needs to be updated to 
support operations on agent resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (MESOS-7594) Implement 'apply' for resource provider related operations

2017-10-18 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150341#comment-16150341
 ] 

Jan Schlicht edited comment on MESOS-7594 at 10/18/17 2:35 PM:
---

https://reviews.apache.org/r/63104/
https://reviews.apache.org/r/61810/
https://reviews.apache.org/r/61946/
https://reviews.apache.org/r/63105/
https://reviews.apache.org/r/61947/


was (Author: nfnt):
https://reviews.apache.org/r/61810/
https://reviews.apache.org/r/61946/
https://reviews.apache.org/r/61947/

> Implement 'apply' for resource provider related operations
> --
>
> Key: MESOS-7594
> URL: https://issues.apache.org/jira/browse/MESOS-7594
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
>
> Resource providers provide new offer operations ({{CREATE_BLOCK}}, 
> {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations 
> can be applied by frameworks when they accept on offer. Handling of these 
> operations has to be added to the master's {{accept}} call. I.e. the 
> corresponding resource provider needs be extracted from the offer's resources 
> and a {{resource_provider::Event::OPERATION}} has to be sent to the resource 
> provider. The resource provider will answer with a 
> {{resource_provider::Call::Update}} which needs to be handled as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8087) Add operation status update handler in Master.

2017-10-16 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8087:

  Sprint: Mesosphere Sprint 65
Story Points: 5
  Labels: mesosphere  (was: )

> Add operation status update handler in Master.
> --
>
> Key: MESOS-8087
> URL: https://issues.apache.org/jira/browse/MESOS-8087
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Please follow this doc for details.
> https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit#
> This handler will process operation status update from resource providers. 
> Depends on whether it's old or new operations, the logic is slightly 
> different.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8089) Add messages to publish resources on a resource provider

2017-10-16 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8089:

Sprint: Mesosphere Sprint 65  (was: Mesosphere Sprint 66)

> Add messages to publish resources on a resource provider
> 
>
> Key: MESOS-8089
> URL: https://issues.apache.org/jira/browse/MESOS-8089
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Before launching a task that uses resource provider resources, the resource 
> provider needs to be informed to "publish" these resources as it may take 
> some necessary actions. For external resource providers resources might also 
> have to be "unpublished" when a task is finished. The resource provider needs 
> to ack these calls after it's ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (MESOS-8087) Add operation status update handler in Master.

2017-10-16 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8087:
---

Assignee: Jan Schlicht

> Add operation status update handler in Master.
> --
>
> Key: MESOS-8087
> URL: https://issues.apache.org/jira/browse/MESOS-8087
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jan Schlicht
>
> Please follow this doc for details.
> https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit#
> This handler will process operation status update from resource providers. 
> Depends on whether it's old or new operations, the logic is slightly 
> different.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8089) Add messages to publish resources on a resource provider

2017-10-13 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8089:

  Sprint: Mesosphere Sprint 66
Story Points: 7

> Add messages to publish resources on a resource provider
> 
>
> Key: MESOS-8089
> URL: https://issues.apache.org/jira/browse/MESOS-8089
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> Before launching a task that uses resource provider resources, the resource 
> provider needs to be informed to "publish" these resources as it may take 
> some necessary actions. For external resource providers resources might also 
> have to be "unpublished" when a task is finished. The resource provider needs 
> to ack these calls after it's ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-8089) Add messages to publish resources on a resource provider

2017-10-13 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-8089:
---

 Summary: Add messages to publish resources on a resource provider
 Key: MESOS-8089
 URL: https://issues.apache.org/jira/browse/MESOS-8089
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Before launching a task that uses resource provider resources, the resource 
provider needs to be informed to "publish" these resources as it may take some 
necessary actions. For external resource providers resources might also have to 
be "unpublished" when a task is finished. The resource provider needs to ack 
these calls after it's ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.

2017-09-21 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174656#comment-16174656
 ] 

Jan Schlicht commented on MESOS-7995:
-

Forgot to mention it: Mine's also a SSL build (--enable-libevent --enable-ssl), 
using libevent 2.0.22. Latest HEAD (c0293a6f7d457a595a3763662e3a9740db31859b).

> libprocess tests breaking on macOS.
> ---
>
> Key: MESOS-7995
> URL: https://issues.apache.org/jira/browse/MESOS-7995
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Priority: Blocker
>
> Many libprocess tests fail on macOS, some even abort.
> Examples:
> {noformat}
> [--] 8 tests from HTTPConnectionTest
> [ RUN  ] HTTPConnectionTest.GzipRequestBody
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure
> Failed to wait 15secs for connect
> [  FAILED  ] HTTPConnectionTest.GzipRequestBody (15001 ms)
> [ RUN  ] HTTPConnectionTest.Serial
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.Serial (0 ms)
> [ RUN  ] HTTPConnectionTest.Pipeline
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.Pipeline (1 ms)
> [ RUN  ] HTTPConnectionTest.ClosingRequest
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.ClosingRequest (0 ms)
> [ RUN  ] HTTPConnectionTest.ClosingResponse
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.ClosingResponse (0 ms)
> [ RUN  ] HTTPConnectionTest.ReferenceCounting
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure
> (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.ReferenceCounting (1 ms)
> [ RUN  ] HTTPConnectionTest.Equality
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.Equality (0 ms)
> [ RUN  ] HTTPConnectionTest.RequestStreaming
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.RequestStreaming (0 ms)
> [--] 8 tests from HTTPConnectionTest (15003 ms total)
> {noformat}
> {noformat}
> [--] 8 tests from HttpAuthenticationTest
> [ RUN  ] HttpAuthenticationTest.NoAuthenticator
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure
> (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure
> Actual function call count doesn't match EXPECT_CALL(*http.process, 
> authenticated(_, Option::none()))...
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> [  FAILED  ] HttpAuthenticationTest.NoAuthenticator (1 ms)
> [ RUN  ] HttpAuthenticationTest.Unauthorized
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure
> (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() 
> Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: 
> Host is down
> *** Check failure stack trace: ***
> *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are 
> using GNU date ***
> PC: @ 0x7fff5cd45fce __pthread_kill
> *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) 
> stack trace: ***
> @ 0x7fff5ce76f5a _sigtramp
> @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared()
> @ 0x7fff5cca232a abort
> @0x1077b9659 google::logging_fail()
> @0x1077b964a google::LogMessage::Fail()
> @0x1077b72fc google::LogMessage::SendToLog()
> @0x1077b8089 google::LogMessage::Flush()
> @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal()
> @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal()
> @0x106998ad1 process::Future<>::get()
> @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody()
> @0x1070a828e 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
>

[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.

2017-09-21 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174636#comment-16174636
 ] 

Jan Schlicht commented on MESOS-7995:
-

Is there something specific different in your environment? Can't reproduce this 
on macOS 10.13, Apple Clang 9.0.0. All libprocess tests are successful.

> libprocess tests breaking on macOS.
> ---
>
> Key: MESOS-7995
> URL: https://issues.apache.org/jira/browse/MESOS-7995
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Priority: Blocker
>
> Many libprocess tests fail on macOS, some even abort.
> Examples:
> {noformat}
> [--] 8 tests from HTTPConnectionTest
> [ RUN  ] HTTPConnectionTest.GzipRequestBody
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure
> Failed to wait 15secs for connect
> [  FAILED  ] HTTPConnectionTest.GzipRequestBody (15001 ms)
> [ RUN  ] HTTPConnectionTest.Serial
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.Serial (0 ms)
> [ RUN  ] HTTPConnectionTest.Pipeline
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.Pipeline (1 ms)
> [ RUN  ] HTTPConnectionTest.ClosingRequest
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.ClosingRequest (0 ms)
> [ RUN  ] HTTPConnectionTest.ClosingResponse
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.ClosingResponse (0 ms)
> [ RUN  ] HTTPConnectionTest.ReferenceCounting
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure
> (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.ReferenceCounting (1 ms)
> [ RUN  ] HTTPConnectionTest.Equality
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.Equality (0 ms)
> [ RUN  ] HTTPConnectionTest.RequestStreaming
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure
> (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> [  FAILED  ] HTTPConnectionTest.RequestStreaming (0 ms)
> [--] 8 tests from HTTPConnectionTest (15003 ms total)
> {noformat}
> {noformat}
> [--] 8 tests from HttpAuthenticationTest
> [ RUN  ] HttpAuthenticationTest.NoAuthenticator
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure
> (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure
> Actual function call count doesn't match EXPECT_CALL(*http.process, 
> authenticated(_, Option::none()))...
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> [  FAILED  ] HttpAuthenticationTest.NoAuthenticator (1 ms)
> [ RUN  ] HttpAuthenticationTest.Unauthorized
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure
> (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() 
> Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: 
> Host is down
> *** Check failure stack trace: ***
> *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are 
> using GNU date ***
> PC: @ 0x7fff5cd45fce __pthread_kill
> *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) 
> stack trace: ***
> @ 0x7fff5ce76f5a _sigtramp
> @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared()
> @ 0x7fff5cca232a abort
> @0x1077b9659 google::logging_fail()
> @0x1077b964a google::LogMessage::Fail()
> @0x1077b72fc google::LogMessage::SendToLog()
> @0x1077b8089 google::LogMessage::Flush()
> @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal()
> @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal()
> @0x106998ad1 process::Future<>::get()
> @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody()
> @0x1070a828e 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @

[jira] [Updated] (MESOS-7594) Implement 'apply' for resource provider related operations

2017-09-06 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7594:

Story Points: 5

> Implement 'apply' for resource provider related operations
> --
>
> Key: MESOS-7594
> URL: https://issues.apache.org/jira/browse/MESOS-7594
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
>
> Resource providers provide new offer operations ({{CREATE_BLOCK}}, 
> {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations 
> can be applied by frameworks when they accept on offer. Handling of these 
> operations has to be added to the master's {{accept}} call. I.e. the 
> corresponding resource provider needs be extracted from the offer's resources 
> and a {{resource_provider::Event::OPERATION}} has to be sent to the resource 
> provider. The resource provider will answer with a 
> {{resource_provider::Call::Update}} which needs to be handled as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-7594) Implement 'apply' for resource provider related operations

2017-09-01 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7594:

Sprint: Mesosphere Sprint 57, Mesosphere Sprint 62  (was: Mesosphere Sprint 
57)

> Implement 'apply' for resource provider related operations
> --
>
> Key: MESOS-7594
> URL: https://issues.apache.org/jira/browse/MESOS-7594
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
>
> Resource providers provide new offer operations ({{CREATE_BLOCK}}, 
> {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations 
> can be applied by frameworks when they accept on offer. Handling of these 
> operations has to be added to the master's {{accept}} call. I.e. the 
> corresponding resource provider needs be extracted from the offer's resources 
> and a {{resource_provider::Event::OPERATION}} has to be sent to the resource 
> provider. The resource provider will answer with a 
> {{resource_provider::Call::Update}} which needs to be handled as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-7816) Add HTTP connection handling to the resource provider driver

2017-07-20 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7816:

Labels: mesosphere storage  (was: mesosphere)

> Add HTTP connection handling to the resource provider driver
> 
>
> Key: MESOS-7816
> URL: https://issues.apache.org/jira/browse/MESOS-7816
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
>
> The {{resource_provider::Driver}} is responsible for establishing a 
> connection with an agent/master resource provider API and provide calls to 
> the API, receive events from the API. This is done using HTTP and should be 
> implemented similar to how it's done for schedulers and executors (see 
> {{src/executor/executor.cpp, src/scheduler/scheduler.cpp}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-7816) Add HTTP connection handling to the resource provider driver

2017-07-20 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7816:
---

 Summary: Add HTTP connection handling to the resource provider 
driver
 Key: MESOS-7816
 URL: https://issues.apache.org/jira/browse/MESOS-7816
 Project: Mesos
  Issue Type: Task
  Components: storage
Reporter: Jan Schlicht
Assignee: Jan Schlicht


The {{resource_provider::Driver}} is responsible for establishing a connection 
with an agent/master resource provider API and provide calls to the API, 
receive events from the API. This is done using HTTP and should be implemented 
similar to how it's done for schedulers and executors (see 
{{src/executor/executor.cpp, src/scheduler/scheduler.cpp}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager

2017-07-19 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7780:

Story Points: 5

> Add `SUBSCRIBE` call handling to the resource provider manager
> --
>
> Key: MESOS-7780
> URL: https://issues.apache.org/jira/browse/MESOS-7780
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: storage
>
> Resource providers will use the HTTP API to subscribe to the 
> {{ResourceProviderManager}}. Handling these calls needs to be implemented. On 
> subscription, a unique resource provider ID will be assigned to the resource 
> provider and a {{SUBSCRIBED}} event will be sent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager

2017-07-18 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7780:

Sprint: Mesosphere Sprint 59

> Add `SUBSCRIBE` call handling to the resource provider manager
> --
>
> Key: MESOS-7780
> URL: https://issues.apache.org/jira/browse/MESOS-7780
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: storage
>
> Resource providers will use the HTTP API to subscribe to the 
> {{ResourceProviderManager}}. Handling these calls needs to be implemented. On 
> subscription, a unique resource provider ID will be assigned to the resource 
> provider and a {{SUBSCRIBED}} event will be sent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager

2017-07-18 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7780:

Sprint:   (was: Mesosphere Sprint 59)

> Add `SUBSCRIBE` call handling to the resource provider manager
> --
>
> Key: MESOS-7780
> URL: https://issues.apache.org/jira/browse/MESOS-7780
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: storage
>
> Resource providers will use the HTTP API to subscribe to the 
> {{ResourceProviderManager}}. Handling these calls needs to be implemented. On 
> subscription, a unique resource provider ID will be assigned to the resource 
> provider and a {{SUBSCRIBED}} event will be sent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager

2017-07-11 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7780:
---

 Summary: Add `SUBSCRIBE` call handling to the resource provider 
manager
 Key: MESOS-7780
 URL: https://issues.apache.org/jira/browse/MESOS-7780
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Resource providers will use the HTTP API to subscribe to the 
{{ResourceProviderManager}}. Handling these calls needs to be implemented. On 
subscription, a unique resource provider ID will be assigned to the resource 
provider and a {{SUBSCRIBED}} event will be sent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7758) Stout doesn't build standalone.

2017-07-05 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074393#comment-16074393
 ] 

Jan Schlicht commented on MESOS-7758:
-

Libprocess is affected as well.
{noformat}
$ cd build/3rdparty/libprocess
$ make
...
make[1]: *** No rule to make target `googlemock-build-stamp'.  Stop.
make: *** [../googletest-release-1.8.0/googlemock-build-stamp] Error 2
{noformat}


> Stout doesn't build standalone.
> ---
>
> Key: MESOS-7758
> URL: https://issues.apache.org/jira/browse/MESOS-7758
> Project: Mesos
>  Issue Type: Bug
>  Components: build, stout
>Reporter: James Peach
>
> Stout doesn't build in a standalone configuration:
> {noformat}
> $ cd ~/src/mesos/3rdparty/stout
> $ ./bootstrap
> $ cd ~/build/stout
> $ ~/src/mesos/3rdparty/stout/configure
> ...
> $ make
> ...
> make[1]: Leaving directory '/home/vagrant/build/stout/3rdparty'
> make[1]: Entering directory '/home/vagrant/build/stout/3rdparty'
> make[1]: *** No rule to make target 'googlemock-build-stamp'.  Stop.
> make[1]: Leaving directory '/home/vagrant/build/stout/3rdparty'
> make: *** [Makefile:1902: 
> 3rdparty/googletest-release-1.8.0/googlemock-build-stamp] Error 2
> {noformat}
> Note that the build expects 
> {{3rdparty/googletest-release-1.8.0/googlemock-build-stamp}}, but 
> {{googletest}} hasn't been staged yet:
> {noformat}
> [vagrant@fedora-26 stout]$ ls -l 3rdparty/
> total 44
> drwxr-xr-x.  3 vagrant vagrant  4096 Jan 18  2016 boost-1.53.0
> -rw-rw-r--.  1 vagrant vagrant 0 Jul  5 06:16 boost-1.53.0-stamp
> drwxrwxr-x.  8 vagrant vagrant  4096 Aug 15  2016 elfio-3.2
> -rw-rw-r--.  1 vagrant vagrant 0 Jul  5 06:16 elfio-3.2-stamp
> drwxr-xr-x. 10 vagrant vagrant  4096 Jul  5 06:16 glog-0.3.3
> -rw-rw-r--.  1 vagrant vagrant 0 Jul  5 06:16 glog-0.3.3-build-stamp
> -rw-rw-r--.  1 vagrant vagrant 0 Jul  5 06:16 glog-0.3.3-stamp
> -rw-rw-r--.  1 vagrant vagrant   734 Jul  5 06:03 gmock_sources.cc
> -rw-rw-r--.  1 vagrant vagrant 25657 Jul  5 06:03 Makefile
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (MESOS-7696) Update resource provider design in the master

2017-06-20 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7696:
---

 Summary: Update resource provider design in the master
 Key: MESOS-7696
 URL: https://issues.apache.org/jira/browse/MESOS-7696
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht
Assignee: Jan Schlicht


Some discussion around how to use the allocator result in changes to how local 
resource providers and external resource providers should be handled in the 
master. The current approach needs to be updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (MESOS-7595) Implement local resource provider registration

2017-05-31 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-7595:
---

Assignee: Jan Schlicht

> Implement local resource provider registration
> --
>
> Key: MESOS-7595
> URL: https://issues.apache.org/jira/browse/MESOS-7595
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should 
> add that one to the list of registered resource providers in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7595) Implement local resource provider registration

2017-05-31 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7595:
---

 Summary: Implement local resource provider registration
 Key: MESOS-7595
 URL: https://issues.apache.org/jira/browse/MESOS-7595
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht


A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should add 
that one to the list of registered resource providers in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (MESOS-7595) Implement local resource provider registration

2017-05-31 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7595:

Shepherd: Jie Yu

> Implement local resource provider registration
> --
>
> Key: MESOS-7595
> URL: https://issues.apache.org/jira/browse/MESOS-7595
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should 
> add that one to the list of registered resource providers in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7592) Add handling of local resource providers to the master

2017-05-31 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7592:
---

 Summary: Add handling of local resource providers to the master
 Key: MESOS-7592
 URL: https://issues.apache.org/jira/browse/MESOS-7592
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht
Assignee: Jan Schlicht


To support local resource providers the master has to keep track of the 
registered ones, and their allocated resources/ outstanding offers. This is 
similar to how it's already done for agents, hence this existing functionality 
could be abstracted and reused for local resource providers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7558) Add resource provider validation

2017-05-24 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7558:
---

 Summary: Add resource provider validation
 Key: MESOS-7558
 URL: https://issues.apache.org/jira/browse/MESOS-7558
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht


Similar to how it's done during agent registration/re-registration, the 
informations provided by a resource provider need to get validation during 
certain operation (e.g. re-registration, while applying offer operations, ...).
Some of these validations only cover the provided informations (e.g. are the 
resources in {{ResourceProviderInfo}} only of type {{disk}}), others take the 
current cluster state into account (e.g. do the resources that a task wants to 
use exist on the resource provider).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7557) Test that resource providers can re-register after a master failover

2017-05-24 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7557:
---

 Summary: Test that resource providers can re-register after a 
master failover
 Key: MESOS-7557
 URL: https://issues.apache.org/jira/browse/MESOS-7557
 Project: Mesos
  Issue Type: Task
Reporter: Jan Schlicht


Restarting a master in a test environment should trigger a resource provider 
re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MESOS-7556) Wait for resource provider re-registrations after a master failover

2017-05-24 Thread Jan Schlicht (JIRA)

Jan Schlicht created MESOS-7556:
---

 Summary: Wait for resource provider re-registrations after a 
master failover
 Key: MESOS-7556
 URL: https://issues.apache.org/jira/browse/MESOS-7556
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht


Recover all resource provider IDs from registrar after a failover and set up 
timeouts for resource providers to re-register.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

1 2 3 4 >

1 - 100 of 382 matches

Mail list logo