[jira] [Commented] (MESOS-9969) Agent crashes when trying to clean up volue
[ https://issues.apache.org/jira/browse/MESOS-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932173#comment-16932173 ] Jan Schlicht commented on MESOS-9969: - This looks like MESOS-9966. > Agent crashes when trying to clean up volue > --- > > Key: MESOS-9969 > URL: https://issues.apache.org/jira/browse/MESOS-9969 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.2 >Reporter: Tomas Barton >Priority: Major > > {code} > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081748 21828 > linux_launcher.cpp:650] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/370ed262-4041-4180-a7e1-9ea78070e3a6' > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081876 21832 > containerizer.cpp:2907] Checkpointing termination state to nested container's > runtime directory > '/var/run/mesos/containers/8e3997e7-c53a-4043-9a7e-26a2e436a041/containers/ae0bdc6d-c738-4352-b5d4-7572182671d5/termination' > Sep 17 13:49:26 w03 mesos-agent[21803]: mesos-agent: > /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:120: T& > Option::get() & [with T = std::basic_string]: Assertion `isSome()' > failed. > Sep 17 13:49:26 w03 mesos-agent[21803]: *** Aborted at 1568728166 (unix time) > try "date -d @1568728166" if you are using GNU date *** > Sep 17 13:49:26 w03 mesos-agent[21803]: W0917 13:49:26.082281 21835 > disk.cpp:453] Ignoring cleanup for unknown container > a9ba6959-ea02-4543-b7d5-92a63940 > Sep 17 13:49:26 w03 mesos-agent[21803]: PC: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: *** SIGABRT (@0x552b) received by PID > 21803 (TID 0x7f169e47d700) from PID 21803; stack trace: *** > Sep 17 13:49:26 w03 mesos-agent[21803]: E0917 13:49:26.082608 21835 > memory.cpp:501] Listening on OOM events failed for container > a9ba6959-ea02-4543-b7d5-92a63940: Event listener is terminating > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3be50e0 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a386942a (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860e67 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.083741 21835 > linux.cpp:1074] Unmounting volume > '/var/lib/mesos/slave/slaves/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-S17/frameworks/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-0003/executors/es01__coordinator__8591ac8e-3d9d-45ac-bb68-bee379c8c4a4/runs/a9ba6959-ea02-4543-b7d5-92a63940/container-path' > for con > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860f12 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7654f13 > _ZNR6OptionISsE3getEv.part.152 > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7666b2f > mesos::internal::slave::MesosContainerizerProcess::__destroy() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a861cb41 > process::ProcessBase::consume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a8633c9c > process::ProcessManager::resume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a86398a6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a43c6200 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3bdb4a4 start_thread > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a391dd0f (unknown) > Sep 17 13:49:26 w03 systemd[1]: dcos-mesos-slave.service: Main process > exited, code=killed, status=6/ABRT > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932152#comment-16932152 ] Jan Schlicht commented on MESOS-9966: - You're right, the flag is enabled. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] > Recovering provisioner > 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] > Successfully loaded 64 Docker images > 2019-09-09 05:04:26: I0909 05:
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932150#comment-16932150 ] Jan Schlicht commented on MESOS-9966: - According to the stack trace we are hitting the code. Let me double-check if {{gc_non_executor_container_sandboxes}} is enabled. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] > Recovering provisioner > 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 met
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931187#comment-16931187 ] Jan Schlicht commented on MESOS-9966: - The flag wasn't set so it's at its default value which is {{false}}. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] > Recovering provisioner > 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] > Successfully loaded 64 Docker image
[jira] [Created] (MESOS-9968) WWWAuthenticate header parsing fails when commas are in (quoted) realm
Jan Schlicht created MESOS-9968: --- Summary: WWWAuthenticate header parsing fails when commas are in (quoted) realm Key: MESOS-9968 URL: https://issues.apache.org/jira/browse/MESOS-9968 Project: Mesos Issue Type: Bug Components: HTTP API, libprocess Reporter: Jan Schlicht This was discovered when trying to launch the {{[nvcr.io/nvidia/tensorflow:19.08-py3|http://nvcr.io/nvidia/tensorflow:19.08-py3]}} image using the Mesos containerizer. This launch fails with {noformat} Failed to launch container: Failed to get WWW-Authenticate header: Unexpected auth-param format: 'realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull' in 'realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push";' {noformat} This is because the [header tokenization in libprocess|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L640] can't handle commas in quoted realm values. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
Jan Schlicht created MESOS-9966: --- Summary: Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well Key: MESOS-9966 URL: https://issues.apache.org/jira/browse/MESOS-9966 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 1.7.3 Reporter: Jan Schlicht Noticed an agent crash-looping when trying to recover. It recognized a container and its nested container as orphaned. When trying to destroy the nested container, the agent crashes. Probably when trying to [get the sandbox path of the root container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. {noformat} 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] Recovering Linux launcher 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is a known orphaned container 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is a known orphaned container 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] Recovering isolators 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started listening on 'low' memory pressure events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started listening on 'medium' memory pressure events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started listening on 'critical' memory pressure events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started listening on 'low' memory pressure events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started listening on 'medium' memory pressure events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started listening on 'critical' memory pressure events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] Recovering provisioner 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] Successfully loaded 64 Docker images 2019-09-09 05:04:26: I0909 05:04:26.388420 89932 provisioner.cpp:494] Provisioner recovery complete 2019-09-09 05:04:26: I0909 05:04:26.388530 90003 containerizer.cpp:1203] Cleaning up orphan container a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 2019-09-09 05:04:26: I0909 05:04:26.388562 90003 containerizer.cpp:2520] Destroying container a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 in RUNNING state 2019-09-09 05:04:26: I0909 05:04:26.388576 90003 containerizer.cpp:3187]
[jira] [Created] (MESOS-9885) Resource provider configuration are only removing its container, causing issues in failover scenarios
Jan Schlicht created MESOS-9885: --- Summary: Resource provider configuration are only removing its container, causing issues in failover scenarios Key: MESOS-9885 URL: https://issues.apache.org/jira/browse/MESOS-9885 Project: Mesos Issue Type: Bug Components: resource provider Affects Versions: 1.8.0 Reporter: Jan Schlicht An agent could crash while it is handling a {{REMOVE_RESOURCE_PROVIDER_CONFIG}} call. In that case, the resource provider won't be removed. This is because its configuration is only removed if the actual resource provider container has been stopped. I.e. in {{LocalResourceProviderDaemonProcess::remove}} {{os::rm}} is only called if {{cleanupContainers}} was successful. After agent failover, the resource provider will still be running. This can be a problem for frameworks/operators, because there isn't a feedback channel that informs them if their removal requests was successful or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9743) Argument forwaring in CMake build result in glog 0.4.0 build as shared library
[ https://issues.apache.org/jira/browse/MESOS-9743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825847#comment-16825847 ] Jan Schlicht commented on MESOS-9743: - cc [~asekretenko] Looks like this change was intended? In https://reviews.apache.org/r/70387/ the imported location is changed from {{glog}} to {{libglog}}, i.e. from a static to a dynamic library. In that case, it's probably related to the Ninja build system and a byproduct isn't copied. But then, building with {{BUILD_SHARED_LIBS=ON}} will cause problems, because GLog would be build as static lib and we expect a dynamic library now. > Argument forwaring in CMake build result in glog 0.4.0 build as shared library > -- > > Key: MESOS-9743 > URL: https://issues.apache.org/jira/browse/MESOS-9743 > Project: Mesos > Issue Type: Bug > Components: cmake >Affects Versions: 1.8.0 > Environment: macOS 10.14.4, clang 8.0.0, Ninja build system >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Major > Labels: build, easyfix, mesosphere, triaged > > GLog versions >= 0.3.5 introduces a {{BUILD_SHARED_LIBS}} CMake option. The > CMake configuration of Mesos also has such an option. Because these options > are forwarded to third-party packages, GLog will be build as a shared library > if Mesos is build with {{BUILD_SHARED_LIBS=OFF}}. This is not intended, as in > that case the GLog shared library is not copied over, resulting in Mesos > binaries failing to start. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9743) Argument forwaring in CMake build result in glog 0.4.0 build as shared library
Jan Schlicht created MESOS-9743: --- Summary: Argument forwaring in CMake build result in glog 0.4.0 build as shared library Key: MESOS-9743 URL: https://issues.apache.org/jira/browse/MESOS-9743 Project: Mesos Issue Type: Bug Components: cmake Affects Versions: 1.8.0 Environment: macOS 10.14.4, clang 8.0.0 Reporter: Jan Schlicht Assignee: Jan Schlicht GLog versions >= 0.3.5 introduces a {{BUILD_SHARED_LIBS}} CMake option. The CMake configuration of Mesos also has such an option. Because these options are forwarded to third-party packages, GLog will be build as a shared library if Mesos is build with {{BUILD_SHARED_LIBS=OFF}}. This is not intended, as in that case the GLog shared library is not copied over, resulting in Mesos binaries failing to start. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9594) Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815177#comment-16815177 ] Jan Schlicht commented on MESOS-9594: - While trying to reproduce this locally, running {noformat} stress-ng --cpu=100 --io 20 --vm 20 --fork 100 --timeout 3600s & GLOG_v=1 src/mesos-tests --verbose --gtest_filter=*RetryRpcWithExponentialBackoff --gtest_repeat=-1 --gtest_break_on_failure {noformat} this crashes in a similar manner as reported in MESOS-9712. Log: [^RetryRpcWithExponentialBackoff-segfault.txt] > Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is > flaky. > > > Key: MESOS-9594 > URL: https://issues.apache.org/jira/browse/MESOS-9594 > Project: Mesos > Issue Type: Bug > Components: storage, test >Reporter: Chun-Hung Hsiao >Assignee: Jan Schlicht >Priority: Major > Labels: flaky-test, mesosphere, storage > Attachments: RetryRpcWithExponentialBackoff-badrun.txt, > RetryRpcWithExponentialBackoff-segfault.txt > > > Observed on ASF CI: > {noformat} > /tmp/SRC/src/tests/storage_local_resource_provider_tests.cpp:5027 > Failed to wait 1mins for offers > {noformat} > Full log: [^RetryRpcWithExponentialBackoff-badrun.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9712) StorageLocalResourceProviderTest.CsiPluginRpcMetrics is flaky
Jan Schlicht created MESOS-9712: --- Summary: StorageLocalResourceProviderTest.CsiPluginRpcMetrics is flaky Key: MESOS-9712 URL: https://issues.apache.org/jira/browse/MESOS-9712 Project: Mesos Issue Type: Bug Components: storage Environment: Debian 9, Mesos configured with SSL support Reporter: Jan Schlicht >From an internal CI run: {noformat} [ RUN ] StorageLocalResourceProviderTest.CsiPluginRpcMetrics 06:56:26 I0409 06:56:26.350445 23181 cluster.cpp:176] Creating default 'local' authorizer 06:56:26 malloc_consolidate(): invalid chunk size 06:56:26 *** Aborted at 1554792986 (unix time) try "date -d @1554792986" if you are using GNU date *** 06:56:26 PC: @ 0x7f1cf4481f3b (unknown) 06:56:26 *** SIGABRT (@0x5a8d) received by PID 23181 (TID 0x7f1ce9be8700) from PID 23181; stack trace: *** 06:56:26 @ 0x7f1cf461b8e0 __GI___pthread_rwlock_rdlock 06:56:26 @ 0x7f1cf4481f3b (unknown) 06:56:26 @ 0x7f1cf44832f1 (unknown) 06:56:26 @ 0x7f1cf44c4867 (unknown) 06:56:26 @ 0x7f1cf44cae0a (unknown) 06:56:26 @ 0x7f1cf44cb10e (unknown) 06:56:26 @ 0x7f1cf44cddad (unknown) 06:56:26 @ 0x7f1cf44cf7dd (unknown) 06:56:26 @ 0x7f1cf4a647a8 (unknown) 06:56:26 @ 0x7f1cf88d0805 google::LogMessage::Init() 06:56:26 @ 0x7f1cf88d10ac google::LogMessage::LogMessage() 06:56:26 @ 0x7f1cf752a46a mesos::internal::master::Master::initialize() 06:56:26 @ 0x7f1cf882bd72 process::ProcessManager::resume() 06:56:26 @ 0x7f1cf88303c6 _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv 06:56:26 @ 0x7f1cf4a8ee6f (unknown) 06:56:26 @ 0x7f1cf4610f2a (unknown) 06:56:26 @ 0x7f1cf4543edf (unknown) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9594) Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-9594: --- Assignee: Jan Schlicht > Test `StorageLocalResourceProviderTest.RetryRpcWithExponentialBackoff` is > flaky. > > > Key: MESOS-9594 > URL: https://issues.apache.org/jira/browse/MESOS-9594 > Project: Mesos > Issue Type: Bug > Components: storage, test >Reporter: Chun-Hung Hsiao >Assignee: Jan Schlicht >Priority: Major > Labels: flaky-test, mesosphere, storage > Attachments: RetryRpcWithExponentialBackoff-badrun.txt > > > Observed on ASF CI: > {noformat} > /tmp/SRC/src/tests/storage_local_resource_provider_tests.cpp:5027 > Failed to wait 1mins for offers > {noformat} > Full log: [^RetryRpcWithExponentialBackoff-badrun.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9612) Resource provider manager assumes all operations are triggered by frameworks
[ https://issues.apache.org/jira/browse/MESOS-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-9612: --- Assignee: Jan Schlicht > Resource provider manager assumes all operations are triggered by frameworks > > > Key: MESOS-9612 > URL: https://issues.apache.org/jira/browse/MESOS-9612 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Bannier >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere, mesosphere-dss-ga, storage > > When the agent tries to apply an operation to resource provider resources, it > invokes {{ResourceProviderManager::applyOperation}} which in turn invokes > {{ResourceProviderManagerProcess::applyOperation}}. That function currently > assumes that the received message contains a valid {{FrameworkID}}, > {noformat} > void ResourceProviderManagerProcess::applyOperation( > const ApplyOperationMessage& message) > > > { > const Offer::Operation& operation = message.operation_info(); > > > > const FrameworkID& frameworkId = message.framework_id(); // > `framework_id` is `optional`. > {noformat} > Since {{FrameworkID}} is not a trivial proto types, but instead one with a > {{required}} field {{value}}, the message composed with the {{frameworkId}} > below cannot be serialized which leads to a failure below which in turn > triggers a {{CHECK}} failure in the agent's function interfacing with the > manager. > A typical scenario where we would want to support operator API calls here is > to destroy leftover persistent volumes or reservations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9631) MasterLoadTest.SimultaneousBatchedRequests segfaults on macOS
Jan Schlicht created MESOS-9631: --- Summary: MasterLoadTest.SimultaneousBatchedRequests segfaults on macOS Key: MESOS-9631 URL: https://issues.apache.org/jira/browse/MESOS-9631 Project: Mesos Issue Type: Bug Components: test Environment: macOS Mojave 10.14.3 Reporter: Jan Schlicht Also tested on Linux, where this test succeeds. {{GLOG_v=1}} output of this test on macOS: {noformat} I0304 09:33:08.532002 155725824 master.cpp:414] Master 8be09e79-ff3b-49bf-86e9-cde00fbdcdaa (172.18.8.49) started on 172.18.8.49:56584 I0304 09:33:08.532045 155725824 master.cpp:417] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/uCWwLH/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/uCWwLH/master" --zk_session_timeout="10secs" I0304 09:33:08.532878 155725824 master.cpp:466] Master only allowing authenticated frameworks to register I0304 09:33:08.532889 155725824 master.cpp:472] Master only allowing authenticated agents to register I0304 09:33:08.532896 155725824 master.cpp:478] Master only allowing authenticated HTTP frameworks to register I0304 09:33:08.532903 155725824 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/uCWwLH/credentials' I0304 09:33:08.533071 155725824 master.cpp:522] Using default 'crammd5' authenticator I0304 09:33:08.533094 155725824 authenticator.cpp:520] Initializing server SASL I0304 09:33:08.551656 155725824 auxprop.cpp:73] Initialized in-memory auxiliary property plugin I0304 09:33:08.551702 155725824 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0304 09:33:08.551745 155725824 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0304 09:33:08.551766 155725824 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0304 09:33:08.551785 155725824 master.cpp:603] Authorization enabled I0304 09:33:08.551923 154116096 whitelist_watcher.cpp:77] No whitelist given I0304 09:33:08.551964 151969792 hierarchical.cpp:208] Initialized hierarchical allocator process I0304 09:33:08.553930 151969792 master.cpp:2103] Elected as the leading master! I0304 09:33:08.553966 151969792 master.cpp:1638] Recovering from registrar I0304 09:33:08.554018 153579520 registrar.cpp:339] Recovering registrar I0304 09:33:08.556378 155725824 registrar.cpp:383] Successfully fetched the registry (0B) in 2.342912ms I0304 09:33:08.556512 155725824 registrar.cpp:487] Applied 1 operations in 38854ns; attempting to update the registry I0304 09:33:08.558737 153579520 registrar.cpp:544] Successfully updated the registry in 2.206976ms I0304 09:33:08.558776 153579520 registrar.cpp:416] Successfully recovered registrar I0304 09:33:08.55 153042944 master.cpp:1752] Recovered 0 agents from the registry (136B); allowing 10mins for agents to reregister I0304 09:33:08.558929 155725824 hierarchical.cpp:248] Skipping recovery of hierarchical allocator: nothing to recover I0304 09:33:08.561846 162198976 sched.cpp:232] Version: 1.8.0 I0304 09:33:08.562060 155189248 sched.cpp:336] New master detected at master@172.18.8.49:56584 I0304 09:33:08.562099 155189248 sched.cpp:401] Authenticating with master master@172.18.8.49:56584 I0304 09:33:08.562110 155189248 sched.cpp:408] Using default CRAM-MD5 authenticatee I0304 09:33:08.562196 1541160
[jira] [Assigned] (MESOS-9521) MasterAPITest.OperationUpdatesUponAgentGone is flaky
[ https://issues.apache.org/jira/browse/MESOS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-9521: --- Assignee: Benno Evers > MasterAPITest.OperationUpdatesUponAgentGone is flaky > > > Key: MESOS-9521 > URL: https://issues.apache.org/jira/browse/MESOS-9521 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.8.0 > Environment: Fedora28, cmake w/ SSL >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Major > Labels: flaky, flaky-test > > The recently added test {{MasterAPITest.OperationUpdatesUponAgentGone}} is > flaky, e.g., > {noformat}../src/tests/api_tests.cpp:5051: Failure > Value of: resources.empty() > Actual: true > Expected: false > ../3rdparty/libprocess/src/../include/process/gmock.hpp:504: Failure > Actual function call count doesn't match EXPECT_CALL(filter->mock, filter(to, > testing::A()))... > Expected args: message matcher (32-byte object 24-00 00-00 00-00 00-00 24-00 00-00 00-00 00-00 41-63 74-75 61-6C 20-66>, > 1-byte object ) > Expected: to be called once >Actual: never called - unsatisfied and active > {noformat} > I am able to reproduce this reliable in less than 10 iterations when running > the test in repetition under additional system stress. > Even if the test does not fail it produces the following gmock warning, > {noformat} > GMOCK WARNING: > Uninteresting mock function call - returning directly. > Function call: disconnected() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9520) IOTest.Read hangs on Windows
Jan Schlicht created MESOS-9520: --- Summary: IOTest.Read hangs on Windows Key: MESOS-9520 URL: https://issues.apache.org/jira/browse/MESOS-9520 Project: Mesos Issue Type: Bug Components: test Environment: Windows Reporter: Jan Schlicht Noticed in test runs that {{IOTest.Read}} hangs in Windows environments. Test runs need to be aborted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9480) Master may skip processing authorization results for `LAUNCH_GROUP`.
[ https://issues.apache.org/jira/browse/MESOS-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-9480: --- Assignee: Chun-Hung Hsiao (was: Jan Schlicht) > Master may skip processing authorization results for `LAUNCH_GROUP`. > > > Key: MESOS-9480 > URL: https://issues.apache.org/jira/browse/MESOS-9480 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0 >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Blocker > Labels: mesosphere > > If there is a validation error for {{LAUNCH_GROUP}}, or if there are multiple > authorization errors for some of the tasks in a {{LAUNCH_GROUP}}, the master > will skip processing the remaining authorization results, which would result > in these authorization results being examined by subsequent operations > incorrectly: > https://github.com/apache/mesos/blob/3ade731d0c1772206c4afdf56318cfab6356acee/src/master/master.cpp#L5487-L5521 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9480) Master may skip processing authorization results for `LAUNCH_GROUP`.
[ https://issues.apache.org/jira/browse/MESOS-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-9480: --- Assignee: Jan Schlicht (was: Chun-Hung Hsiao) > Master may skip processing authorization results for `LAUNCH_GROUP`. > > > Key: MESOS-9480 > URL: https://issues.apache.org/jira/browse/MESOS-9480 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0 >Reporter: Chun-Hung Hsiao >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere > > If there is a validation error for {{LAUNCH_GROUP}}, or if there are multiple > authorization errors for some of the tasks in a {{LAUNCH_GROUP}}, the master > will skip processing the remaining authorization results, which would result > in these authorization results being examined by subsequent operations > incorrectly: > https://github.com/apache/mesos/blob/3ade731d0c1772206c4afdf56318cfab6356acee/src/master/master.cpp#L5487-L5521 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579747#comment-16579747 ] Jan Schlicht edited comment on MESOS-8568 at 8/15/18 12:19 PM: --- -No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This particular 500 return code is actually a no-op in the containerizer. We don't need to call {{WAIT_NESTED_CONTAINER}} here.- was (Author: nfnt): No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This particular 500 return code is actually a no-op in the containerizer. We don't need to call {{WAIT_NESTED_CONTAINER}} here. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580929#comment-16580929 ] Jan Schlicht commented on MESOS-8568: - Scratch my older comment. {{REMOVE_NESTED_CONTAINER}} has to called on a destroyed container, because as part of this call, the containers runtime directory will be removed. I.e., if this call isn't successful, it will leak the containers runtime directory. This is the case in the scenario above. Hence, the checker has to call {{WAIT_NESTED_CONTAINER}} to make sure that it's not calling {{REMOVE_NESTED_CONTAINER}} on a container that is currently being destroyed. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9153) Failures when isolating cgroups can leak containers
Jan Schlicht created MESOS-9153: --- Summary: Failures when isolating cgroups can leak containers Key: MESOS-9153 URL: https://issues.apache.org/jira/browse/MESOS-9153 Project: Mesos Issue Type: Bug Affects Versions: 1.5.1 Reporter: Jan Schlicht Attachments: health_check_leak.txt When the isolation of cgroups fail (e.g., if cgroup hierarchies changed, as described in [MESOS-3488|https://issues.apache.org/jira/browse/MESOS-3488]) this will lead to a leaked container. Maybe only for nested container. The attached log is a {{VLOG(2)}} logs of a nested container that's started as part of a command health check for Kafka. I've removed all log lines unrelated to this container. Also, the cgroup hierarchy has been manipulated, to run into MESOS-3488. The linux launcher fails while the containerizer is in {{ISOLATING}} state. The containerizer transitions to {{DESTROYING}} and tries to cleanup the isolators. The isolators ignore the cleanup requests, because the container ID seems to be unknown to them. In case of the Linux Filesystem Isolator, this leads to the container directory not getting cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579747#comment-16579747 ] Jan Schlicht commented on MESOS-8568: - No, the {{REMOVE_NESTED_CONTAINER}} shouldn't be a problem here. This particular 500 return code is actually a no-op in the containerizer. We don't need to call {{WAIT_NESTED_CONTAINER}} here. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579656#comment-16579656 ] Jan Schlicht commented on MESOS-8568: - I've linked MESOS-9131, as it's very similar: Calling {{REMOVE_NESTED_CONTAINER}} while that container is being destroyed seems to result in a race condition, though it isn't yet clear why. > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
Jan Schlicht created MESOS-9131: --- Summary: Health checks launching nested containers while a container is being destroyed lead to unkillable tasks Key: MESOS-9131 URL: https://issues.apache.org/jira/browse/MESOS-9131 Project: Mesos Issue Type: Bug Components: agent Reporter: Jan Schlicht A container might get stuck in {{DESTROYING}} state if there's a command health check that starts new nested containers while its parent container is getting destroyed. Here are some logs which unrelated lines removed. The `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping afterwards. {noformat} 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807] Container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has exited 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354] Destroying container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in RUNNING state 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968] Transitioning the state of container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 from RUNNING to DESTROYING 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514] Asked to destroy container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560] Using freezer to destroy cgroup mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 after 3.814144ms 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 2018-04-16 12:37:54: I0416 12:37:54.256599 3853 cgroups.cpp:1444] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 after 5.977856ms ... 2018-04-16 12:37:54: I0416 12:37:54.371117 3837 http.cpp:3502] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd' 2018-04-16 12:37:54: W0416 12:37:54.371692 3842 http.cpp:2758] Failed to launch container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd: Parent container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is in 'DESTROYING' state 2018-04-16 12:37:54: W0416 12:37:54.371826 3840 containerizer.cpp:2337] Attempted to destroy unknown container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd ... 2018-04-16 12:37:55: I0416 12:37:55.504456 3856 http.cpp:3078] Processing REMOVE_NESTED_CONTAINER call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6' ... 2018-04-16 12:37:55: I0416 12:37:55.556367 3857 http.cpp:3502] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211' ... 2018-04-16 12:37:55: W0416 12:37:55.582137 3850 http.cpp:2758] Failed to launch container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211: Parent container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is in 'DESTROYING' state ... 2018-04-16 12:37:55: W0416 12:37:55.583330 3844 containerizer.cpp:2337] Attempted to destroy unknown container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211 ... {noformat} This stops when the framework reconciles and instructs Mesos to kill the task. Which also results in a {noformat} 2018-04-16 13:06:04: I0416 13:06:04.161623 3843 http.cpp:2966] Processing KILL_NESTED_CONTAINER call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133' {noformat} Nothing else related to this contai
[jira] [Assigned] (MESOS-9094) On macOS libprocess_tests fail to link when compiling with gRPC
[ https://issues.apache.org/jira/browse/MESOS-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-9094: --- Assignee: Jan Schlicht > On macOS libprocess_tests fail to link when compiling with gRPC > --- > > Key: MESOS-9094 > URL: https://issues.apache.org/jira/browse/MESOS-9094 > Project: Mesos > Issue Type: Bug > Environment: macOS 10.13.6 with clang 6.0.1. >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Major > Fix For: 1.7.0 > > > Seems like this was introduces with commit > {{a211b4cadf289168464fc50987255d883c226e89}}. Linking {{libprocess-tests}} on > macOS with enabled gRPC fails with > {noformat} > Undefined symbols for architecture x86_64: > > "grpc::TimePoint std::__1::chrono::duration > > > >::you_need_a_specialization_of_TimePoint()", referenced from: > process::Future > > process::grpc::client::Runtime::call, > std::__1::default_delete > > > (tests::PingPong::Stub::*)(grpc::ClientContext*, tests::Ping const&, > grpc::CompletionQueue*), tests::Ping, tests::Pong, > 0>(process::grpc::client::Connection const&, > std::__1::unique_ptr, > std::__1::default_delete > > > (tests::PingPong::Stub::*&&)(grpc::ClientContext*, tests::Ping const&, > grpc::CompletionQueue*), tests::Ping&&, process::grpc::client::CallOptions > const&)::'lambda'(tests::Ping const&, bool, > grpc::CompletionQueue*)::operator()(tests::Ping const&, bool, > grpc::CompletionQueue*) const in libprocess_tests-grpc_tests.o > ld: symbol(s) not found for architecture x86_64 > clang-6.0: error: linker command failed with exit code 1 (use -v to see > invocation) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9094) On macOS libprocess_tests fail to link when compiling with gRPC
[ https://issues.apache.org/jira/browse/MESOS-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548922#comment-16548922 ] Jan Schlicht commented on MESOS-9094: - cc [~chhsia0]. Found https://grpc.io/grpc/cpp/classgrpc_1_1_time_point.html which seems to be related. > On macOS libprocess_tests fail to link when compiling with gRPC > --- > > Key: MESOS-9094 > URL: https://issues.apache.org/jira/browse/MESOS-9094 > Project: Mesos > Issue Type: Bug > Environment: macOS 10.13.6 with clang 6.0.1. >Reporter: Jan Schlicht >Priority: Major > Fix For: 1.7.0 > > > Seems like this was introduces with commit > {{a211b4cadf289168464fc50987255d883c226e89}}. Linking {{libprocess-tests}} on > macOS with enabled gRPC fails with > {noformat} > Undefined symbols for architecture x86_64: > > "grpc::TimePoint std::__1::chrono::duration > > > >::you_need_a_specialization_of_TimePoint()", referenced from: > process::Future > > process::grpc::client::Runtime::call, > std::__1::default_delete > > > (tests::PingPong::Stub::*)(grpc::ClientContext*, tests::Ping const&, > grpc::CompletionQueue*), tests::Ping, tests::Pong, > 0>(process::grpc::client::Connection const&, > std::__1::unique_ptr, > std::__1::default_delete > > > (tests::PingPong::Stub::*&&)(grpc::ClientContext*, tests::Ping const&, > grpc::CompletionQueue*), tests::Ping&&, process::grpc::client::CallOptions > const&)::'lambda'(tests::Ping const&, bool, > grpc::CompletionQueue*)::operator()(tests::Ping const&, bool, > grpc::CompletionQueue*) const in libprocess_tests-grpc_tests.o > ld: symbol(s) not found for architecture x86_64 > clang-6.0: error: linker command failed with exit code 1 (use -v to see > invocation) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9094) On macOS libprocess_tests fail to link when compiling with gRPC
Jan Schlicht created MESOS-9094: --- Summary: On macOS libprocess_tests fail to link when compiling with gRPC Key: MESOS-9094 URL: https://issues.apache.org/jira/browse/MESOS-9094 Project: Mesos Issue Type: Bug Environment: macOS 10.13.6 with clang 6.0.1. Reporter: Jan Schlicht Fix For: 1.7.0 Seems like this was introduces with commit {{a211b4cadf289168464fc50987255d883c226e89}}. Linking {{libprocess-tests}} on macOS with enabled gRPC fails with {noformat} Undefined symbols for architecture x86_64: "grpc::TimePoint > > >::you_need_a_specialization_of_TimePoint()", referenced from: process::Future > process::grpc::client::Runtime::call, std::__1::default_delete > > (tests::PingPong::Stub::*)(grpc::ClientContext*, tests::Ping const&, grpc::CompletionQueue*), tests::Ping, tests::Pong, 0>(process::grpc::client::Connection const&, std::__1::unique_ptr, std::__1::default_delete > > (tests::PingPong::Stub::*&&)(grpc::ClientContext*, tests::Ping const&, grpc::CompletionQueue*), tests::Ping&&, process::grpc::client::CallOptions const&)::'lambda'(tests::Ping const&, bool, grpc::CompletionQueue*)::operator()(tests::Ping const&, bool, grpc::CompletionQueue*) const in libprocess_tests-grpc_tests.o ld: symbol(s) not found for architecture x86_64 clang-6.0: error: linker command failed with exit code 1 (use -v to see invocation) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7441) RegisterSlaveValidationTest.DropInvalidRegistration is flaky
[ https://issues.apache.org/jira/browse/MESOS-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531286#comment-16531286 ] Jan Schlicht commented on MESOS-7441: - Reopened, as there was a recent test run (on {{master}}, SHA {{b50f6c8a}}) failing on CentOS 6 with {noformat} [ RUN ] RegisterSlaveValidationTest.DropInvalidRegistration I0703 11:44:46.746553 16172 cluster.cpp:173] Creating default 'local' authorizer I0703 11:44:46.747535 16196 master.cpp:463] Master cce3860c-7d4f-4996-b865-fc8ce8302705 (ip-172-16-10-44.ec2.internal) started on 172.16.10.44:33909 I0703 11:44:46.747611 16196 master.cpp:466] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/dwPsJP/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/dwPsJP/master" --zk_session_timeout="10secs" I0703 11:44:46.747733 16196 master.cpp:515] Master only allowing authenticated frameworks to register I0703 11:44:46.747748 16196 master.cpp:521] Master only allowing authenticated agents to register I0703 11:44:46.747754 16196 master.cpp:527] Master only allowing authenticated HTTP frameworks to register I0703 11:44:46.747761 16196 credentials.hpp:37] Loading credentials for authentication from '/tmp/dwPsJP/credentials' I0703 11:44:46.747872 16196 master.cpp:571] Using default 'crammd5' authenticator I0703 11:44:46.747907 16196 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0703 11:44:46.747944 16196 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0703 11:44:46.747967 16196 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0703 11:44:46.747997 16196 master.cpp:652] Authorization enabled I0703 11:44:46.748157 16194 hierarchical.cpp:177] Initialized hierarchical allocator process I0703 11:44:46.748183 16194 whitelist_watcher.cpp:77] No whitelist given I0703 11:44:46.748715 16196 master.cpp:2162] Elected as the leading master! I0703 11:44:46.748736 16196 master.cpp:1717] Recovering from registrar I0703 11:44:46.748950 16196 registrar.cpp:339] Recovering registrar I0703 11:44:46.749035 16196 registrar.cpp:383] Successfully fetched the registry (0B) in 68864ns I0703 11:44:46.749059 16196 registrar.cpp:487] Applied 1 operations in 5058ns; attempting to update the registry I0703 11:44:46.749349 16196 registrar.cpp:544] Successfully updated the registry in 275968ns I0703 11:44:46.749385 16196 registrar.cpp:416] Successfully recovered registrar I0703 11:44:46.749465 16196 master.cpp:1831] Recovered 0 agents from the registry (172B); allowing 10mins for agents to reregister I0703 11:44:46.749589 16196 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover W0703 11:44:46.751214 16172 process.cpp:2824] Attempted to spawn already running process files@172.16.10.44:33909 I0703 11:44:46.751505 16172 containerizer.cpp:300] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I0703 11:44:46.753739 16172 linux_launcher.cpp:146] Using /cgroup/freezer as the freezer hierarchy for the Linux launcher I0703 11:44:46.754091 16172 provisioner.cpp:298] Using default backend 'copy' I0703 11:44:46.754447 16172 cluster.cpp:479] Creating default 'local' authorizer I0703 11:44:46.754907 16195 slave.cpp:268] Mesos agent started on (361)@172.16.10.44:33909 I0703 11:44:46.754920 16195 slave.cpp:269] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/RegisterSlaveValidationTest_DropInvalidRegistration_W7jYUL/store/appc" --authenticate_http_executors="true" --authenticate_http_readonly="true"
[jira] [Created] (MESOS-9045) LogZooKeeperTest.WriteRead can segfault
Jan Schlicht created MESOS-9045: --- Summary: LogZooKeeperTest.WriteRead can segfault Key: MESOS-9045 URL: https://issues.apache.org/jira/browse/MESOS-9045 Project: Mesos Issue Type: Bug Affects Versions: 1.5.1 Environment: macOS Reporter: Jan Schlicht The following segfault occured when testing the {{1.5.x}} branch (SHA {{64341865d}}) on macOS: {noformat} [ RUN ] LogZooKeeperTest.WriteRead I0702 00:49:46.259831 2560127808 jvm.cpp:590] Looking up method (Ljava/lang/String;)V I0702 00:49:46.260002 2560127808 jvm.cpp:590] Looking up method deleteOnExit()V I0702 00:49:46.260550 2560127808 jvm.cpp:590] Looking up method (Ljava/io/File;Ljava/io/File;)V log4j:WARN No appenders could be found for logger (org.apache.zookeeper.server.persistence.FileTxnSnapLog). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. I0702 00:49:46.305560 2560127808 jvm.cpp:590] Looking up method ()V I0702 00:49:46.306149 2560127808 jvm.cpp:590] Looking up method (Lorg/apache/zookeeper/server/persistence/FileTxnSnapLog;Lorg/apache/zookeeper/server/ZooKeeperServer$DataTreeBuilder;)V I0702 00:49:46.07 2560127808 jvm.cpp:590] Looking up method ()V I0702 00:49:46.343977 2560127808 jvm.cpp:590] Looking up method (I)V I0702 00:49:46.344200 2560127808 jvm.cpp:590] Looking up method configure(Ljava/net/InetSocketAddress;I)V I0702 00:49:46.357642 2560127808 jvm.cpp:590] Looking up method startup(Lorg/apache/zookeeper/server/ZooKeeperServer;)V I0702 00:49:46.437831 2560127808 jvm.cpp:590] Looking up method getClientPort()I I0702 00:49:46.437893 2560127808 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 54057 I0702 00:49:46.438153 2560127808 log_tests.cpp:2468] Using temporary directory '/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/LogZooKeeperTest_WriteRead_AKZArL' I0702 00:49:46.440680 2560127808 leveldb.cpp:174] Opened db in 2.415822ms I0702 00:49:46.441301 2560127808 leveldb.cpp:181] Compacted db in 584251ns I0702 00:49:46.441349 2560127808 leveldb.cpp:196] Created db iterator in 20482ns I0702 00:49:46.441380 2560127808 leveldb.cpp:202] Seeked to beginning of db in 14577ns I0702 00:49:46.441407 2560127808 leveldb.cpp:277] Iterated through 0 keys in the db in 16622ns I0702 00:49:46.441447 2560127808 replica.cpp:795] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0702 00:49:46.441737 207974400 leveldb.cpp:310] Persisting metadata (8 bytes) to leveldb took 157037ns I0702 00:49:46.441764 207974400 replica.cpp:322] Persisted replica status to VOTING I0702 00:49:46.443361 2560127808 leveldb.cpp:174] Opened db in 1.305425ms I0702 00:49:46.443821 2560127808 leveldb.cpp:181] Compacted db in 448477ns I0702 00:49:46.443871 2560127808 leveldb.cpp:196] Created db iterator in 12681ns I0702 00:49:46.443889 2560127808 leveldb.cpp:202] Seeked to beginning of db in 13291ns I0702 00:49:46.443914 2560127808 leveldb.cpp:277] Iterated through 0 keys in the db in 14460ns I0702 00:49:46.443944 2560127808 replica.cpp:795] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0702 00:49:46.444277 206901248 leveldb.cpp:310] Persisting metadata (8 bytes) to leveldb took 234740ns I0702 00:49:46.444317 206901248 replica.cpp:322] Persisted replica status to VOTING I0702 00:49:46.445854 2560127808 leveldb.cpp:174] Opened db in 1.253613ms I0702 00:49:46.446967 2560127808 leveldb.cpp:181] Compacted db in 1.096521ms I0702 00:49:46.447022 2560127808 leveldb.cpp:196] Created db iterator in 14312ns I0702 00:49:46.447048 2560127808 leveldb.cpp:202] Seeked to beginning of db in 16620ns I0702 00:49:46.447077 2560127808 leveldb.cpp:277] Iterated through 1 keys in the db in 21267ns I0702 00:49:46.447113 2560127808 replica.cpp:795] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned 2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@765: Client environment:os.arch=17.4.0 2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin I0702 00:49:46.447453 206901248 log.cpp:108] Attempting to join replica to ZooKeeper group 2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@766: Client envi
[jira] [Created] (MESOS-9044) DefaultExecutorTest.ROOT_ContainerStatusForTask can segfault
Jan Schlicht created MESOS-9044: --- Summary: DefaultExecutorTest.ROOT_ContainerStatusForTask can segfault Key: MESOS-9044 URL: https://issues.apache.org/jira/browse/MESOS-9044 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.1 Environment: Ubuntu 16.04 Reporter: Jan Schlicht The following segfault occured when testing the {{1.5.x}} branch (SHA {{64341865d}}) on Ubuntu 16.04: {noformat} [ RUN ] MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0 I0702 08:32:25.241318 17172 cluster.cpp:172] Creating default 'local' authorizer I0702 08:32:25.242328 6510 master.cpp:457] Master be25b90e-f63d-4935-aaf3-cacfc7faacbf (ip-172-16-10-86.ec2.internal) started on 172.16.10.86:32891 I0702 08:32:25.242413 6510 master.cpp:459] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/I9TI6h/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/I9TI6h/master" --zk_session_timeout="10secs" I0702 08:32:25.242554 6510 master.cpp:508] Master only allowing authenticated frameworks to register I0702 08:32:25.242564 6510 master.cpp:514] Master only allowing authenticated agents to register I0702 08:32:25.242570 6510 master.cpp:520] Master only allowing authenticated HTTP frameworks to register I0702 08:32:25.242575 6510 credentials.hpp:37] Loading credentials for authentication from '/tmp/I9TI6h/credentials' I0702 08:32:25.242677 6510 master.cpp:564] Using default 'crammd5' authenticator I0702 08:32:25.242728 6510 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0702 08:32:25.242780 6510 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0702 08:32:25.242830 6510 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0702 08:32:25.242864 6510 master.cpp:643] Authorization enabled I0702 08:32:25.243048 6507 hierarchical.cpp:175] Initialized hierarchical allocator process I0702 08:32:25.243223 6507 whitelist_watcher.cpp:77] No whitelist given I0702 08:32:25.243743 6510 master.cpp:2210] Elected as the leading master! I0702 08:32:25.243768 6510 master.cpp:1690] Recovering from registrar I0702 08:32:25.243832 6511 registrar.cpp:347] Recovering registrar I0702 08:32:25.244055 6511 registrar.cpp:391] Successfully fetched the registry (0B) in 124928ns I0702 08:32:25.244096 6511 registrar.cpp:495] Applied 1 operations in 8690ns; attempting to update the registry I0702 08:32:25.244261 6511 registrar.cpp:552] Successfully updated the registry in 146944ns I0702 08:32:25.244302 6511 registrar.cpp:424] Successfully recovered registrar I0702 08:32:25.244416 6511 master.cpp:1803] Recovered 0 agents from the registry (172B); allowing 10mins for agents to re-register I0702 08:32:25.244556 6505 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W0702 08:32:25.246150 17172 process.cpp:2759] Attempted to spawn already running process files@172.16.10.86:32891 I0702 08:32:25.246560 17172 containerizer.cpp:304] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I0702 08:32:25.250222 17172 linux_launcher.cpp:146] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher I0702 08:32:25.250689 17172 provisioner.cpp:299] Using default backend 'overlay' I0702 08:32:25.251200 17172 cluster.cpp:460] Creating default 'local' authorizer I0702 08:32:25.251788 6509 slave.cpp:262] Mesos agent started on (996)@172.16.10.86:32891 I0702 08:32:25.251878 6509 slave.cpp:263] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/
[jira] [Commented] (MESOS-8985) Posting to the operator api with 'accept recordio' header can crash the agent
[ https://issues.apache.org/jira/browse/MESOS-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507953#comment-16507953 ] Jan Schlicht commented on MESOS-8985: - This is caused by {{Content-Type}} being (in Mesos terms) non-streaming type, while {{Accept}} indicates a streaming type. We don't cover this case in the current code, make some wrong assumptions and finally erroneously try to serialize to RecordIO which isn't supported. > Posting to the operator api with 'accept recordio' header can crash the agent > - > > Key: MESOS-8985 > URL: https://issues.apache.org/jira/browse/MESOS-8985 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1, 1.5.1 >Reporter: Philip Norman >Assignee: Gilbert Song >Priority: Major > Attachments: mesos-slave-crash.log > > > It's possible to crash the mesos agent by posting a reasonable request to the > operator API. > h3. Background: > Sending a request to the v1 api endpoint with an unsupported 'accept' header: > {code:java} > curl -X POST http://10.0.3.27:5051/api/v1 \ > -H 'accept: application/atom+xml' \ > -H 'content-type: application/json' \ > -d '{"type":"GET_CONTAINERS","get_containers":{"show_nested": > true,"show_standalone": true}}'{code} > Results in the following friendly error message: > {code:java} > Expecting 'Accept' to allow application/json or application/x-protobuf or > application/recordio{code} > h3. Reproducible crash: > However, sending the same request with 'application/recordio' 'accept' header: > {code:java} > curl -X POST \ > http://10.0.3.27:5051/api/v1 \ > -H 'accept: application/recordio' \ > -H 'content-type: application/json' \ > -d '{"type":"GET_CONTAINERS","get_containers":{"show_nested": > true,"show_standalone": true}}'{code} > causes the agent to crash (no response is received). > Crash log is shown below, full log from the agent is attached here: > {code:java} > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > I0607 22:30:32.397320 3743 logfmt.cpp:178] type=audit timestamp=2018-06-07 > 22:30:32.397243904+00:00 reason="Error in token 'Missing 'Authorization' > header from HTTP request'. Allowing anonymous connection" > object="/slave(1)/api/v1" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X > 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 > Safari/537.36" authorizer="mesos-agent" action="POST" result=allow > srcip=10.0.6.99 dstport=5051 srcport=42084 dstip=10.0.3.27 > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > W0607 22:30:32.397434 3743 authenticator.cpp:289] Error in token on request > from '10.0.6.99:42084': Missing 'Authorization' header from HTTP request > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > W0607 22:30:32.397466 3743 authenticator.cpp:291] Falling back to anonymous > connection using user 'dcos_anonymous' > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > I0607 22:30:32.397629 3748 http.cpp:1099] HTTP POST for /slave(1)/api/v1 from > 10.0.6.99:42084 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X > 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 > Safari/537.36' > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > I0607 22:30:32.397784 3748 http.cpp:2030] Processing GET_CONTAINERS call > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > F0607 22:30:32.398736 3747 http.cpp:121] Serializing a RecordIO stream is not > supported > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: > *** Check failure stack trace: *** > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f619478636d google::LogMessage::Fail() > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f619478819d google::LogMessage::SendToLog() > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f6194785f5c google::LogMessage::Flush() > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f6194788a99 google::LogMessageFatal::~LogMessageFatal() > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f61935e2b9d mesos::internal::serialize() > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f6193a4c0ef > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEERKN4JSON5ArrayEEE10CallableFnIZNK5mesos8internal5slave4Http13getContainersERKNSD_5agent4CallENSD_11ContentTypeERK6OptionINS3_14authentication9PrincipalEEEUlRKNS2_IS7_EEE0_EclES9_ > Jun 07 22:30:32 ip-10-0-3-27.us-west-2.compute.internal mesos-agent[3718]: @ > 0x7f6193a
[jira] [Assigned] (MESOS-7329) Authorize offer operations for converting disk resources
[ https://issues.apache.org/jira/browse/MESOS-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-7329: --- Assignee: Jan Schlicht > Authorize offer operations for converting disk resources > > > Key: MESOS-7329 > URL: https://issues.apache.org/jira/browse/MESOS-7329 > Project: Mesos > Issue Type: Task > Components: master, security >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Major > Labels: csi-post-mvp, mesosphere, security, storage > > All offer operations are authorized, hence authorization logic has to be > added to new offer operations as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8896) 'ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors' is flaky
Jan Schlicht created MESOS-8896: --- Summary: 'ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors' is flaky Key: MESOS-8896 URL: https://issues.apache.org/jira/browse/MESOS-8896 Project: Mesos Issue Type: Bug Components: flaky Reporter: Jan Schlicht This was a test failure on macOS with SSL enabled. Not sure yet if other systems might be affected as well: {noformat} [ RUN ] ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors I0509 01:36:35.181434 2992141120 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 58450 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@765: Client environment:os.arch=17.4.0 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-05-09 01:36:35,181:44641(0x79f15000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:58450 sessionTimeout=1 watcher=0x1148b6680 sessionId=0 sessionPasswd= context=0x7fe697de7590 flags=0 2018-05-09 01:36:35,182:44641(0x7aa42000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:58450] 2018-05-09 01:36:35,185:44641(0x7aa42000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:58450], sessionId=0x163440b82ec, negotiated timeout=1 I0509 01:36:35.186167 167882752 group.cpp:341] Group process (zookeeper-group(14)@10.0.49.4:57595) connected to ZooKeeper I0509 01:36:35.186213 167882752 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0509 01:36:35.186226 167882752 group.cpp:395] Authenticating with ZooKeeper using digest 2018-05-09 01:36:38,534:44641(0x7aa42000):ZOO_INFO@auth_completion_func@1327: Authentication scheme digest succeeded I0509 01:36:38.534493 167882752 group.cpp:419] Trying to create path '/mesos' in ZooKeeper 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@765: Client environment:os.arch=17.4.0 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/build 2018-05-09 01:36:38,540:44641(0x7a121000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:58450 sessionTimeout=1 watcher=0x1148b6680 sessionId=0 sessionPasswd= context=0x7fe6999c1fe0 flags=0 I0509 01:36:38.540652 166273024 contender.cpp:152] Joining the ZK group 2018-05-09 01:36:38,540:44641(0x7b463000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:58450] 2018-05-09 01:36:38,542:44641(0x7b463000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:58450], sessionId=0x163440b82ec0001, negotiated timeout=1 I0509 01:36:38.542425 168955904 group.cpp:341] Group process (zookeeper-group(15)@10.0.49.4:57595) connected to ZooKeeper I0509 01:36:38.542466 168955904 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0509 01:36:38.542480 168955904 group.cpp:395] Authenticating with ZooKeeper using digest 2018-05-09 01:36:50,559:44641(0x7aa42000):ZOO_WARN@zookeeper_interest@1597: Exceeded deadline by 8687ms 2018-05-09 01:36:50,559:44641(0x7aa42000):ZOO_ERR
[jira] [Created] (MESOS-8868) Some 'FsTest' test cases fail on macOS
Jan Schlicht created MESOS-8868: --- Summary: Some 'FsTest' test cases fail on macOS Key: MESOS-8868 URL: https://issues.apache.org/jira/browse/MESOS-8868 Project: Mesos Issue Type: Bug Environment: macOS 10.13.4, clang 6.0.0. Reporter: Jan Schlicht These tests fail in {{674db615971d2288ffdd1b64f2be93367e03a63d}}: {noformat} [ RUN ] FsTest.CreateDirectoryAtMaxPath ../../../3rdparty/stout/tests/os/filesystem_tests.cpp:243: Failure Value of: (os::realpath(testfile)).get() Actual: "/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/FlHiuR//file.txt" Expected: testfile Which is: "/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/FlHiuR//file.txt" [ FAILED ] FsTest.CreateDirectoryAtMaxPath (1 ms) [ RUN ] FsTest.CreateDirectoryLongerThanMaxPath ../../../3rdparty/stout/tests/os/filesystem_tests.cpp:267: Failure Value of: (os::realpath(testfile)).get() Actual: "/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/tQjz6A/87efabe7-c026-4d44-9174-7ffaffe92aea/fdf3029c-3ccb-472a-91a9-79c56a114f0a/33b71897-2b23-4546-83f1-f77132e48b86/7548fb65-fa84-4260-80ff-a4d9133e5fe3/221b923d-ddc3-473e-a19a-a18863985401/03e8e58d-80a1-40db-8091-3676c5ecba05/file.txt" Expected: testfile Which is: "/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/tQjz6A/87efabe7-c026-4d44-9174-7ffaffe92aea/fdf3029c-3ccb-472a-91a9-79c56a114f0a/33b71897-2b23-4546-83f1-f77132e48b86/7548fb65-fa84-4260-80ff-a4d9133e5fe3/221b923d-ddc3-473e-a19a-a18863985401/03e8e58d-80a1-40db-8091-3676c5ecba05/file.txt" [ FAILED ] FsTest.CreateDirectoryLongerThanMaxPath (1 ms) [ RUN ] FsTest.RealpathValidationOnOpenFile ../../../3rdparty/stout/tests/os/filesystem_tests.cpp:286: Failure Value of: (os::realpath(file)).get() Actual: "/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/k9wmip/b44085df-3da8-4799-9893-80ad4e007a80" Expected: file Which is: "/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/k9wmip/b44085df-3da8-4799-9893-80ad4e007a80" [ FAILED ] FsTest.RealpathValidationOnOpenFile (0 ms) {noformat} Seems like a regression introduced in stout changes that started with {{8b7798f31ea37077e5091d279fcf352a01577366}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8867) CMake: Bundled libevent v2.1.5-beta doesn't compile with OpenSSL 1.1.0
Jan Schlicht created MESOS-8867: --- Summary: CMake: Bundled libevent v2.1.5-beta doesn't compile with OpenSSL 1.1.0 Key: MESOS-8867 URL: https://issues.apache.org/jira/browse/MESOS-8867 Project: Mesos Issue Type: Bug Components: cmake Environment: Fedora 28 with OpenSSL 1.1.0h, {{cmake -G Ninja -D ENABLE_LIBEVENT=ON -D ENABLE_SSL=ON}} Reporter: Jan Schlicht Compiling libevent 2.1.5 beta with OpenSSL 1.1.0 fails with errors like {noformat} /home/vagrant/mesos/build/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c: In function ‘bio_bufferevent_new’: /home/vagrant/mesos/build/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3: error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’} b->init = 0; ^~ {noformat} As this is the version currently bundled by CMake, builds with {{ENABLE_LIBEVENT=ON, ENABLE_SSL=ON}} will fail to compile. Libevent supports OpenSSL 1.1.0 beginning with v2.1.7-rc (see https://github.com/libevent/libevent/pull/397) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8866) CMake builds are missing byproduct declaration for jemalloc.
Jan Schlicht created MESOS-8866: --- Summary: CMake builds are missing byproduct declaration for jemalloc. Key: MESOS-8866 URL: https://issues.apache.org/jira/browse/MESOS-8866 Project: Mesos Issue Type: Bug Components: cmake Environment: Cmake with {{-G Ninja}} and {{-D ENABLE_JEMALLOC_ALLOCATOR=ON}}. Reporter: Jan Schlicht Assignee: Jan Schlicht The {{jemalloc}} dependency is missing a byproduct declaration in the CMake configuration. As a result, building Mesos with enabled {{jemalloc}} using CMake and Ninja will fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7854) Authorize resource calls to provider manager api
[ https://issues.apache.org/jira/browse/MESOS-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453760#comment-16453760 ] Jan Schlicht commented on MESOS-7854: - Closing this in favor of MESOS-8774, as that ticket is more specific. > Authorize resource calls to provider manager api > > > Key: MESOS-7854 > URL: https://issues.apache.org/jira/browse/MESOS-7854 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Critical > Labels: csi-post-mvp, mesosphere, storage > > The resource provider manager provides a function > {code} > process::Future api( > const process::http::Request& request, > const Option& principal) const; > {code} > which is exposed e.g., as an agent endpoint. > We need to add authorization to this function in order to e.g., stop rough > callers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8774) Authenticate and authorize calls to the resource provider manager's API
[ https://issues.apache.org/jira/browse/MESOS-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-8774: --- Assignee: Jan Schlicht > Authenticate and authorize calls to the resource provider manager's API > > > Key: MESOS-8774 > URL: https://issues.apache.org/jira/browse/MESOS-8774 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Benjamin Bannier >Assignee: Jan Schlicht >Priority: Major > Labels: mesosphere > > The resource provider manager is exposed via an agent endpoint against which > resource providers subscribe or perform other actions. We should authenticate > and authorize any interactions there. > Since currently local resource providers run on agents who manages their > lifetime it seems natural to extend the framework used for executor > authentication to resource providers as well. The agent would then generate a > secret token whenever a new resource provider is started and inject it into > the resource providers it launches. Resource providers in turn would use this > token when interacting with the manager API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8818) VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS
[ https://issues.apache.org/jira/browse/MESOS-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447800#comment-16447800 ] Jan Schlicht commented on MESOS-8818: - cc [~jpe...@apache.org] > VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS > --- > > Key: MESOS-8818 > URL: https://issues.apache.org/jira/browse/MESOS-8818 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: macOS 10.13.4 >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Major > Labels: mesosphere > > This test fails on macOS with: > {noformat} > [ RUN ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume > I0423 10:55:19.624977 2767623040 containerizer.cpp:296] Using isolation { > environment_secret, filesystem/posix, volume/sandbox_path } > I0423 10:55:19.625176 2767623040 provisioner.cpp:299] Using default backend > 'copy' > ../../src/tests/containerizer/volume_sandbox_path_isolator_tests.cpp:130: > Failure > create: Unknown or unsupported isolator 'volume/sandbox_path' > [ FAILED ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume (3 ms) > {noformat} > Likely a regression introduced in commit > {{189efed864ca2455674b0790d6be4a73c820afd6}} which removed > {{volume/sandbox_path}} for POSIX. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8818) VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS
Jan Schlicht created MESOS-8818: --- Summary: VolumeSandboxPathIsolatorTest.SharedParentTypeVolume fails on macOS Key: MESOS-8818 URL: https://issues.apache.org/jira/browse/MESOS-8818 Project: Mesos Issue Type: Bug Components: containerization Environment: macOS 10.13.4 Reporter: Jan Schlicht Assignee: Jan Schlicht This test fails on macOS with: {noformat} [ RUN ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume I0423 10:55:19.624977 2767623040 containerizer.cpp:296] Using isolation { environment_secret, filesystem/posix, volume/sandbox_path } I0423 10:55:19.625176 2767623040 provisioner.cpp:299] Using default backend 'copy' ../../src/tests/containerizer/volume_sandbox_path_isolator_tests.cpp:130: Failure create: Unknown or unsupported isolator 'volume/sandbox_path' [ FAILED ] VolumeSandboxPathIsolatorTest.SharedParentTypeVolume (3 ms) {noformat} Likely a regression introduced in commit {{189efed864ca2455674b0790d6be4a73c820afd6}} which removed {{volume/sandbox_path}} for POSIX. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8521) Various IOSwitchboard related tests fail on macOS High Sierra.
[ https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431966#comment-16431966 ] Jan Schlicht commented on MESOS-8521: - Can also confirm that I'm no longer getting these failures on 10.13.4 using LLVM 6.0.0. > Various IOSwitchboard related tests fail on macOS High Sierra. > --- > > Key: MESOS-8521 > URL: https://issues.apache.org/jira/browse/MESOS-8521 > Project: Mesos > Issue Type: Bug > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.39.2) >Reporter: Till Toenshoff >Priority: Major > > The problem appears to cause several switchboard tests to fail. Note that > this problem does not manifest on older Apple systems. > The failure rate on this system is 100%. > List of currently failing tests: > {noformat} > IOSwitchboardTest.ContainerAttach > IOSwitchboardTest.ContainerAttachAfterSlaveRestart > IOSwitchboardTest.OutputRedirectionWithTTY > ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0 > ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1 > {noformat} > This is an example using {{GLOG=v1}} verbose logging: > {noformat} > [ RUN ] IOSwitchboardTest.ContainerAttach > I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { > environment_secret, filesystem/posix, posix/cpu } > I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend > 'copy' > I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering > containerizer > I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery > complete > I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container > 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed > ContainerConfig at > '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config' > I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the > state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to > PREPARING > I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo > terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching > 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" > --help="false" > --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8" > --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" > --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" > --wait_for_connection="false"' for container > 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard > server (pid: 83716) listening on socket file > '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for > container 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching > 'mesos-containerizer' with flags '--help="false" > --launch_info="{"command":{"shell":true,"value":"sleep > 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}]},"task_environment":{},"tty_slave_path":"\/dev\/ttys003","working_directory":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}" > --pipe_read="7" --pipe_write="10" > --runtime_directory="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad"' > I0201 03:02:51.949144 106336256 launcher.cpp:140] Forked child with pid > '83717' for container '1b1af888-9e39-4c13-a647-ac43c0df9fad' > I0201 03:02:51.949896 106336256 containerizer.cpp:2952] Transitioning the > state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PREPARING to > ISOLATING > I0201 03:02:51.951071 106336256 containerizer.cpp:2952] Transitioning the > state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from ISOLATING to > FETCHING > I0201 03:02:51.951190 108482560 fetcher.cpp:369] Starting to fetch URIs for > container: 1b1af888-9e39-4c13-a647-ac43c0df9fad, directory: > /var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_W9gDw0 > I0201 03:02:51.951791 109019136 containerizer.cpp:2952] Transitioning the > state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from FETCHING to > RUNNING > I0201 03:02:52.076602 106872832 containerizer.cpp:2338] Destroying container > 1b1af888-9e39-4c13-a647-ac43c0df9fad in RUNNING state > I0201 03:02:52.076644 106872832 containerizer.cpp:2952] Transitioning the >
[jira] [Assigned] (MESOS-3858) Draft quota limits design document
[ https://issues.apache.org/jira/browse/MESOS-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-3858: --- Assignee: (was: Jan Schlicht) > Draft quota limits design document > -- > > Key: MESOS-3858 > URL: https://issues.apache.org/jira/browse/MESOS-3858 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Priority: Major > Labels: mesosphere, quota > > In the design documents for Quota > (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit#) > the proposed MVP does not include quota limits. Quota limits represent an > upper bound of resources that a role is allowed to use. The task of this > ticket is to outline a design document on how to implement quota limits when > the quota MVP is implemented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8720) CSIClientTest segfaults on macOS.
Jan Schlicht created MESOS-8720: --- Summary: CSIClientTest segfaults on macOS. Key: MESOS-8720 URL: https://issues.apache.org/jira/browse/MESOS-8720 Project: Mesos Issue Type: Bug Components: storage Affects Versions: 1.6.0 Environment: macOS 10.13.3, LLVM 6.0.0 Reporter: Jan Schlicht This seems to be caused by the changes introduced in commit {{79c21981803dafd8a5e971b98961487a69017ce9}}. On a macOS build, configured with {{--enable-grpc}}, all test cases in {{CSIClientTest}} segfault. Running {{src/mesos-tests --gtest_filter=\*CSIClientTest\*}} results in {noformat} [ RUN ] Identity/CSIClientTest.Call/Client_GetSupportedVersions mesos-tests(57309,0x7fffa0293340) malloc: *** error for object 0x10bb63b68: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug *** Aborted at 1521711802 (unix time) try "date -d @1521711802" if you are using GNU date *** PC: @ 0x7fff6738ce3e __pthread_kill *** SIGABRT (@0x7fff6738ce3e) received by PID 57309 (TID 0x7fffa0293340) stack trace: *** @ 0x7fff674bef5a _sigtramp @0x0 (unknown) @ 0x7fff672e9312 abort @ 0x7fff673e6866 free @0x10aec51bd grpc::CompletionQueue::CompletionQueue() @0x10b2087a4 process::grpc::client::Runtime::Data::Data() @0x107bd697d mesos::internal::tests::CSIClientTest::CSIClientTest() @0x107bd68ca testing::internal::ParameterizedTestFactory<>::CreateTest() @0x107c58158 testing::internal::HandleExceptionsInMethodIfSupported<>() @0x107c57fd8 testing::TestInfo::Run() @0x107c588c7 testing::TestCase::Run() @0x107c612b7 testing::internal::UnitTestImpl::RunAllTests() @0x107c60d58 testing::internal::HandleExceptionsInMethodIfSupported<>() @0x107c60cc8 testing::UnitTest::Run() @0x106afc83d main @ 0x7fff6723d115 start @0x2 (unknown) Abort trap: 6 {noformat} Increasing GLog verbosity doesn't provide more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8719) Mesos compiled with `--enable-grpc` doesn't compile on non-Linux builds
Jan Schlicht created MESOS-8719: --- Summary: Mesos compiled with `--enable-grpc` doesn't compile on non-Linux builds Key: MESOS-8719 URL: https://issues.apache.org/jira/browse/MESOS-8719 Project: Mesos Issue Type: Bug Components: storage Affects Versions: 1.6.0 Environment: macOS Reporter: Jan Schlicht Assignee: Jan Schlicht Commit {{59cca968e04dee069e0df2663733b6d6f55af0da}} added {{examples/test_csi_plugin.cpp}} to non-Linux builds that are configured using the {{--enable-grpc}} flag. As {{examples/test_csi_plugin.cpp}} includes {{fs/linux.hpp}}, it can only compile on Linux and needs to be disabled for non-Linux builds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8677) FaulToleranceTest.ReregisterCompletedFrameworks crashes on macOS
Jan Schlicht created MESOS-8677: --- Summary: FaulToleranceTest.ReregisterCompletedFrameworks crashes on macOS Key: MESOS-8677 URL: https://issues.apache.org/jira/browse/MESOS-8677 Project: Mesos Issue Type: Bug Components: test Environment: macOS 10.13.3 with LLVM 6.0.0 as well as with Apple LLVM version 9.0.0 (clang-900.0.39.2) Reporter: Jan Schlicht Here's a {{GLOG_v=1}} run of the test: {noformat} [ RUN ] FaultToleranceTest.ReregisterCompletedFrameworks I0314 14:30:11.240077 2290090816 cluster.cpp:172] Creating default 'local' authorizer I0314 14:30:11.241261 55140352 master.cpp:463] Master 025f775d-9c75-43f6-9ee6-079a605fbf01 (Jenkinss-Mac-mini.local) started on 10.0.49.4:54648 I0314 14:30:11.241287 55140352 master.cpp:465] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ZyMWb1/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ZyMWb1/master" --zk_session_timeout="10secs" I0314 14:30:11.241439 55140352 master.cpp:514] Master only allowing authenticated frameworks to register I0314 14:30:11.241447 55140352 master.cpp:520] Master only allowing authenticated agents to register I0314 14:30:11.241452 55140352 master.cpp:526] Master only allowing authenticated HTTP frameworks to register I0314 14:30:11.241461 55140352 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/ZyMWb1/credentials' I0314 14:30:11.241678 55140352 master.cpp:570] Using default 'crammd5' authenticator I0314 14:30:11.241739 55140352 http.cpp:957] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0314 14:30:11.241824 55140352 http.cpp:957] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0314 14:30:11.241873 55140352 http.cpp:957] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0314 14:30:11.241919 55140352 master.cpp:649] Authorization enabled I0314 14:30:11.242066 52457472 whitelist_watcher.cpp:77] No whitelist given I0314 14:30:11.242079 51920896 hierarchical.cpp:175] Initialized hierarchical allocator process I0314 14:30:11.243557 52994048 master.cpp:2119] Elected as the leading master! I0314 14:30:11.243574 52994048 master.cpp:1678] Recovering from registrar I0314 14:30:11.243640 51920896 registrar.cpp:347] Recovering registrar I0314 14:30:11.243852 52457472 registrar.cpp:391] Successfully fetched the registry (0B) in 190976ns I0314 14:30:11.243928 52457472 registrar.cpp:495] Applied 1 operations in 28606ns; attempting to update the registry I0314 14:30:11.244163 52457472 registrar.cpp:552] Successfully updated the registry in 194816ns I0314 14:30:11.244222 52457472 registrar.cpp:424] Successfully recovered registrar I0314 14:30:11.244408 54067200 master.cpp:1792] Recovered 0 agents from the registry (155B); allowing 10mins for agents to reregister I0314 14:30:11.23 52994048 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W0314 14:30:11.247259 2290090816 process.cpp:2805] Attempted to spawn already running process files@10.0.49.4:54648 I0314 14:30:11.247681 2290090816 cluster.cpp:460] Creating default 'local' authorizer I0314 14:30:11.248837 55676928 slave.cpp:265] Mesos agent started on (50)@10.0.49.4:54648 I0314 14:30:11.248865 55676928 slave.cpp:266] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/FaultToleranceTest_ReregisterCompletedFrameworks_UqvwBG/store/appc" --authenticate_http_executors="true" --
[jira] [Created] (MESOS-8610) NsTest.SupportedNamespaces fails on CentOS7
Jan Schlicht created MESOS-8610: --- Summary: NsTest.SupportedNamespaces fails on CentOS7 Key: MESOS-8610 URL: https://issues.apache.org/jira/browse/MESOS-8610 Project: Mesos Issue Type: Bug Reporter: Jan Schlicht Failed on a {{GLOG_v=1 src/mesos-tests --verbose}} run with {noformat} [ RUN ] NsTest.SupportedNamespaces ../../src/tests/containerizer/ns_tests.cpp:119: Failure Value of: (ns::supported(n)).get() Actual: false Expected: true Which is: true CLONE_NEWUSER ../../src/tests/containerizer/ns_tests.cpp:124: Failure Value of: (ns::supported(allNamespaces)).get() Actual: false Expected: true Which is: true CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUSER [ FAILED ] NsTest.SupportedNamespaces (0 ms) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8603) SlaveTest.TerminalTaskContainerizerUpdateFailsWithGone and SlaveTest.TerminalTaskContainerizerUpdateFailsWithLost are flaky
Jan Schlicht created MESOS-8603: --- Summary: SlaveTest.TerminalTaskContainerizerUpdateFailsWithGone and SlaveTest.TerminalTaskContainerizerUpdateFailsWithLost are flaky Key: MESOS-8603 URL: https://issues.apache.org/jira/browse/MESOS-8603 Project: Mesos Issue Type: Bug Components: test Reporter: Jan Schlicht Attachments: TerminalTaskContainerizerUpdateFailsWithGone, TerminalTaskContainerizerUpdateFailsWithLost Both tests fail from time to time. Attached are verbose test output of failures. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8593) Support credential updates in Docker config without restarting the agent
Jan Schlicht created MESOS-8593: --- Summary: Support credential updates in Docker config without restarting the agent Key: MESOS-8593 URL: https://issues.apache.org/jira/browse/MESOS-8593 Project: Mesos Issue Type: Improvement Components: containerization, docker Reporter: Jan Schlicht When using the Mesos containerizer with a private Docker repository with {{--docker_config}} option, the repository might expire credentials after some time, forcing the user to login again. In that case the Docker config in use will change and the agent needs to be restarted to reflect the change. Instead of restarting, the agent could reload the Docker config file every time before fetching. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User
[ https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365343#comment-16365343 ] Jan Schlicht commented on MESOS-8585: - Looks like this has been introduced in https://reviews.apache.org/r/64630/. cc [~jpe...@apache.org] > Agent Crashes When Ask to Start Task with Unknown User > -- > > Key: MESOS-8585 > URL: https://issues.apache.org/jira/browse/MESOS-8585 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.5.0 >Reporter: Karsten >Priority: Major > Attachments: dcos-mesos-slave.service.1.gz, > dcos-mesos-slave.service.2.gz > > > The Marathon team has an integration test that tries to start a task with an > unknown user. The test expects a \{{TASK_FAILED}}. However, we see > \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent > crashes and restarts. > > {code} > 783 2018-02-14 14:55:45: I0214 14:55:45.319974 6213 slave.cpp:2542] > Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for > framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001 > 784 2018-02-14 14:55:45: I0214 14:55:45.320605 6213 paths.cpp:727] > Creating sandbox > '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05 > 784 > a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88' > for user 'bad' > 785 2018-02-14 14:55:45: F0214 14:55:45.321131 6213 paths.cpp:735] > CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' > Failed to create executor directory '/var/lib/mesos/slave/ > 785 > slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6 > 785 66d4acc88' > 786 2018-02-14 14:55:45: *** Check failure stack trace: *** > 787 2018-02-14 14:55:45: @ 0x7f72033444ad > google::LogMessage::Fail() > 788 2018-02-14 14:55:45: @ 0x7f72033462dd > google::LogMessage::SendToLog() > 789 2018-02-14 14:55:45: @ 0x7f720334409c > google::LogMessage::Flush() > 790 2018-02-14 14:55:45: @ 0x7f7203346bd9 > google::LogMessageFatal::~LogMessageFatal() > 791 2018-02-14 14:55:45: @ 0x56544ca378f9 > _CheckFatal::~_CheckFatal() > 792 2018-02-14 14:55:45: @ 0x7f720270f30d > mesos::internal::slave::paths::createExecutorDirectory() > 793 2018-02-14 14:55:45: @ 0x7f720273812c > mesos::internal::slave::Framework::addExecutor() > 794 2018-02-14 14:55:45: @ 0x7f7202753e35 > mesos::internal::slave::Slave::__run() > 795 2018-02-14 14:55:45: @ 0x7f7202764292 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4 > 795 > listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1 > 795 > 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_ > 796 2018-02-14 14:55:45: @ 0x7f72032a2b11 > process::ProcessBase::consume() > 797 2018-02-14 14:55:45: @ 0x7f72032b183c > process::ProcessManager::resume() > 798 2018-02-14 14:55:45: @ 0x7f72032b6da6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 799 2018-02-14 14:55:45: @ 0x7f72005ced73 (unknown) > 800 2018-02-14 14:55:45: @ 0x7f72000cf52c (unknown) > 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd (unknown) > 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, > code=killed, status=6/ABRT > 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed > state. > 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result > 'signal'. > 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time > over, scheduling restart. > 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel > agent. > 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel > agent... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8424) Test that operations are correctly reported following a master failover
[ https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16348677#comment-16348677 ] Jan Schlicht commented on MESOS-8424: - Only 65043 is merged, the other ones are still in review. Reopening. > Test that operations are correctly reported following a master failover > --- > > Key: MESOS-8424 > URL: https://issues.apache.org/jira/browse/MESOS-8424 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Major > Labels: mesosphere > Fix For: 1.6.0 > > > As the master keeps track of operations running on a resource provider, it > needs to be updated on these operations when agents reregister after a master > failover. E.g., an operation that has finished during the failover should be > reported as finished by the master after the agent on which the resource > provider is running has reregistered. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race
[ https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8524: Summary: When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race (was: When `UPDATE_SLAVE` messages are received, offers might not be recinded due to a race ) > When `UPDATE_SLAVE` messages are received, offers might not be rescinded due > to a race > --- > > Key: MESOS-8524 > URL: https://issues.apache.org/jira/browse/MESOS-8524 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.5.0 > Environment: Master + Agent running with enabled > {{RESOURCE_PROVIDER}} capability >Reporter: Jan Schlicht >Priority: Major > Labels: mesosphere > > When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers > with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In > the master, the agent is added (back) to the allocator, as soon as it's > (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an > allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} > is being handled in the master, these offers have to be rescinded, as they're > based on an outdated agent state. > Internally, the allocator defers a offer callback in the master > ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at > the same time and its handler in the master called before the offer callback > (but after the actual allocation took place). In this case the (outdated) > offer is still sent to frameworks and never rescinded. > Here's the relevant log lines, this was discovered while working on > https://reviews.apache.org/r/65045/: > {noformat} > I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation > for 1 agents in 704915ns > I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent > 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 > (172.18.8.20) with total oversubscribed resources {} > I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to > framework 53c557e7-3161-449b-bacc-a4f8c02e78e7- (default) at > scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469 > I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took > 40444ns > I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent > 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), { } > (used) > I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent > 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total > resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be recinded due to a race
Jan Schlicht created MESOS-8524: --- Summary: When `UPDATE_SLAVE` messages are received, offers might not be recinded due to a race Key: MESOS-8524 URL: https://issues.apache.org/jira/browse/MESOS-8524 Project: Mesos Issue Type: Bug Components: allocation, master Affects Versions: 1.5.0 Environment: Master + Agent running with enabled {{RESOURCE_PROVIDER}} capability Reporter: Jan Schlicht When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In the master, the agent is added (back) to the allocator, as soon as it's (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} is being handled in the master, these offers have to be rescinded, as they're based on an outdated agent state. Internally, the allocator defers a offer callback in the master ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at the same time and its handler in the master called before the offer callback (but after the actual allocation took place). In this case the (outdated) offer is still sent to frameworks and never rescinded. Here's the relevant log lines, this was discovered while working on https://reviews.apache.org/r/65045/: {noformat} I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation for 1 agents in 704915ns I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 (172.18.8.20) with total oversubscribed resources {} I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to framework 53c557e7-3161-449b-bacc-a4f8c02e78e7- (default) at scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469 I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 40444ns I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), { } (used) I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000] {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8490) UpdateSlaveMessageWithPendingOffers is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-8490: --- Assignee: Jan Schlicht (was: Benjamin Bannier) > UpdateSlaveMessageWithPendingOffers is flaky. > - > > Key: MESOS-8490 > URL: https://issues.apache.org/jira/browse/MESOS-8490 > Project: Mesos > Issue Type: Bug > Components: test > Environment: CentOS 6 with SSL > Ubuntu 16.04 >Reporter: Alexander Rukletsov >Assignee: Jan Schlicht >Priority: Major > Labels: flaky-test > Attachments: UpdateSlaveMessageWithPendingOffers-badrun1.txt, > UpdateSlaveMessageWithPendingOffers-badrun2.txt > > > {noformat} > ../../src/tests/master_tests.cpp:8728 > Failed to wait 15secs for offers > {noformat} > Full logs attached. Log output from two failures looks different, might be an > indicator of multiple issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8473) Authorize `GET_OPERATIONS` calls.
Jan Schlicht created MESOS-8473: --- Summary: Authorize `GET_OPERATIONS` calls. Key: MESOS-8473 URL: https://issues.apache.org/jira/browse/MESOS-8473 Project: Mesos Issue Type: Task Components: agent, master Reporter: Jan Schlicht The {{GET_OPERATIONS}} call lists all known operations on a master or agent. Authorization has to be added to this call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8445) Test that `UPDATE_STATE` of a resource provider doesn't have unwanted side-effects in master or agent
Jan Schlicht created MESOS-8445: --- Summary: Test that `UPDATE_STATE` of a resource provider doesn't have unwanted side-effects in master or agent Key: MESOS-8445 URL: https://issues.apache.org/jira/browse/MESOS-8445 Project: Mesos Issue Type: Task Reporter: Jan Schlicht Assignee: Jan Schlicht While we test the correct behavior of {{UPDATE_STATE}} sent by resource providers when an operation state changes or after (re-)registration, this call might also get sent independent from any such event, e.g., if resources are added to a running resource provider. Correct behavior of master and agent need to be tested. Outstanding offers should be rescinded and internal states updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8424) Test that operations are correctly reported following a master failover
Jan Schlicht created MESOS-8424: --- Summary: Test that operations are correctly reported following a master failover Key: MESOS-8424 URL: https://issues.apache.org/jira/browse/MESOS-8424 Project: Mesos Issue Type: Task Components: master Reporter: Jan Schlicht Assignee: Jan Schlicht As the master keeps track of operations running on a resource provider, it needs to be updated on these operations when agents reregister after a master failover. E.g., an operation that has finished during the failover should be reported as finished by the master after the agent on which the resource provider is running has reregistered. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8424) Test that operations are correctly reported following a master failover
[ https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8424: Sprint: Mesosphere Sprint 72 Story Points: 3 > Test that operations are correctly reported following a master failover > --- > > Key: MESOS-8424 > URL: https://issues.apache.org/jira/browse/MESOS-8424 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > > As the master keeps track of operations running on a resource provider, it > needs to be updated on these operations when agents reregister after a master > failover. E.g., an operation that has finished during the failover should be > reported as finished by the master after the agent on which the resource > provider is running has reregistered. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8219) Validate that any offer operation is only applied on resources from a single provider
[ https://issues.apache.org/jira/browse/MESOS-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307956#comment-16307956 ] Jan Schlicht commented on MESOS-8219: - Sure, will work on this. > Validate that any offer operation is only applied on resources from a single > provider > - > > Key: MESOS-8219 > URL: https://issues.apache.org/jira/browse/MESOS-8219 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Benjamin Bannier >Assignee: Jan Schlicht > > Offer operations can only be applied to resources from one single resource > provider. A number of places in the implementation assume that the provider > ID obtained from any {Resource} in an offer operation is equivalent to the > one from any other resource. We should update the master to validate that > invariant and reject malformed operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed
[ https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8346: Shepherd: Benjamin Bannier > Resubscription of a resource provider will crash the agent if its HTTP > connection isn't closed > -- > > Key: MESOS-8346 > URL: https://issues.apache.org/jira/browse/MESOS-8346 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere > > A resource provider might resubscribe while its old HTTP connection wasn't > properly closed. In that case an agent will crashm with, e.g., the following > log: > {noformat} > I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource > provider > {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} > I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider > message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' > I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: > resourceProviders.subscribed.contains(resourceProviderId) > *** Check failure stack trace: *** > E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > @0x1125380ef google::LogMessageFatal::~LogMessageFatal() > @0x112534ae9 google::LogMessageFatal::~LogMessageFatal() > I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation > for 1 agents in 61830ns > I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating > I1219 13:33:51.945955 129146880 master.cpp:1305] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) disconnected > I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated > @0x115f2761d > mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() > @0x115f2977d > _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ > @0x115f29740 > _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_ > @0x115f296bb > _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_ > @0x115f2965d > _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ > @0x115f29631 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEvOT_DpOT0_ > @
[jira] [Commented] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.
[ https://issues.apache.org/jira/browse/MESOS-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298557#comment-16298557 ] Jan Schlicht commented on MESOS-8349: - Discarding a {{Future}} (instead of discarding its {{Promise}}) won't call {{onAny}} callbacks, only a {{onDiscarded}} callback that we haven't set up here. > When a resource provider driver is disconnected, it fails to reconnect. > --- > > Key: MESOS-8349 > URL: https://issues.apache.org/jira/browse/MESOS-8349 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > If the resource provider manager closes the HTTP connection of a resource > provider, the resource provider should reconnect itself. For that, the > resource provider driver will change its state to "DISCONNECTED", call a > {{disconnected}} callback and use its endpoint detector to reconnect. > This doesn't work in a testing environment where a > {{ConstantEndpointDetector}} is used. While the resource provider is notified > of the closed HTTP connection (and logs {{End-Of-File received}}), it never > disconnects itself and calls the {{disconnected}} callback. Discarding > {{HttpConnectionProcess::detection}} in > {{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} > callback of that future. This might not be a problem in > {{HttpConnectionProcess}} but could be related to the test case using a > {{ConstantEndpointDetector}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.
Jan Schlicht created MESOS-8349: --- Summary: When a resource provider driver is disconnected, it fails to reconnect. Key: MESOS-8349 URL: https://issues.apache.org/jira/browse/MESOS-8349 Project: Mesos Issue Type: Bug Affects Versions: 1.5.0 Reporter: Jan Schlicht Assignee: Jan Schlicht If the resource provider manager closes the HTTP connection of a resource provider, the resource provider should reconnect itself. For that, the resource provider driver will change its state to "DISCONNECTED", call a {{disconnected}} callback and use its endpoint detector to reconnect. This doesn't work in a testing environment where a {{ConstantEndpointDetector}} is used. While the resource provider is notified of the closed HTTP connection (and logs {{End-Of-File received}}), it never disconnects itself and calls the {{disconnected}} callback. Discarding {{HttpConnectionProcess::detection}} in {{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} callback of that future. This might not be a problem in {{HttpConnectionProcess}} but could be related to the test case using a {{ConstantEndpointDetector}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed
[ https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298138#comment-16298138 ] Jan Schlicht commented on MESOS-8346: - It will land today, the patch seems to be good, just needs a small update. > Resubscription of a resource provider will crash the agent if its HTTP > connection isn't closed > -- > > Key: MESOS-8346 > URL: https://issues.apache.org/jira/browse/MESOS-8346 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere > > A resource provider might resubscribe while its old HTTP connection wasn't > properly closed. In that case an agent will crashm with, e.g., the following > log: > {noformat} > I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource > provider > {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} > I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider > message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' > I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: > resourceProviders.subscribed.contains(resourceProviderId) > *** Check failure stack trace: *** > E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > @0x1125380ef google::LogMessageFatal::~LogMessageFatal() > @0x112534ae9 google::LogMessageFatal::~LogMessageFatal() > I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation > for 1 agents in 61830ns > I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating > I1219 13:33:51.945955 129146880 master.cpp:1305] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) disconnected > I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated > @0x115f2761d > mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() > @0x115f2977d > _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ > @0x115f29740 > _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_ > @0x115f296bb > _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_ > @0x115f2965d > _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ > @0x115f29631 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERK
[jira] [Created] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed
Jan Schlicht created MESOS-8346: --- Summary: Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed Key: MESOS-8346 URL: https://issues.apache.org/jira/browse/MESOS-8346 Project: Mesos Issue Type: Bug Affects Versions: 1.5.0 Reporter: Jan Schlicht Assignee: Jan Schlicht Priority: Blocker A resource provider might resubscribe while its old HTTP connection wasn't properly closed. In that case an agent will crashm with, e.g., the following log: {noformat} I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource provider {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource provider 8e71beef-796e-4bde-9257-952ed0f230a5 I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource provider 8e71beef-796e-4bde-9257-952ed0f230a5 E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: resourceProviders.subscribed.contains(resourceProviderId) *** Check failure stack trace: *** E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] @0x1125380ef google::LogMessageFatal::~LogMessageFatal() @0x112534ae9 google::LogMessageFatal::~LogMessageFatal() I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation for 1 agents in 61830ns I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) disconnected I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated @0x115f2761d mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() @0x115f2977d _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ @0x115f29740 _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_ @0x115f296bb _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_ @0x115f2965d _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ @0x115f29631 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEvOT_DpOT0_ @0x115f29526 _ZNO6lambda12CallableOnceIFvvEE10CallableFnINS_8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS7_14HttpConnectionERKNS6_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEclEv @0x10b6ca690 _ZNO6lambda12CallableOnceIFvvEEclEv @0x10be09295 _ZZN7process8internal8DispatchIvEclIN6lambda1
[jira] [Created] (MESOS-8315) ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider is flaky
Jan Schlicht created MESOS-8315: --- Summary: ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider is flaky Key: MESOS-8315 URL: https://issues.apache.org/jira/browse/MESOS-8315 Project: Mesos Issue Type: Bug Components: test Reporter: Jan Schlicht Assignee: Jan Schlicht Log from a CI run that failed: {noformat} [ RUN ] ContentType/ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider/1 I1208 02:27:51.541087 4488 cluster.cpp:172] Creating default 'local' authorizer I1208 02:27:51.542224 24578 master.cpp:456] Master d29f2eb9-c698-47cb-aea5-56350dd07581 (ip-172-16-10-30.ec2.internal) started on 172.16.10.30:47245 I1208 02:27:51.542243 24578 master.cpp:458] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/i4FLJ1/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/i4FLJ1/master" --zk_session_timeout="10secs" I1208 02:27:51.542359 24578 master.cpp:507] Master only allowing authenticated frameworks to register I1208 02:27:51.542366 24578 master.cpp:513] Master only allowing authenticated agents to register I1208 02:27:51.542371 24578 master.cpp:519] Master only allowing authenticated HTTP frameworks to register I1208 02:27:51.542376 24578 credentials.hpp:37] Loading credentials for authentication from '/tmp/i4FLJ1/credentials' I1208 02:27:51.542466 24578 master.cpp:563] Using default 'crammd5' authenticator I1208 02:27:51.542503 24578 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1208 02:27:51.542539 24578 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1208 02:27:51.542564 24578 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1208 02:27:51.542593 24578 master.cpp:642] Authorization enabled I1208 02:27:51.542634 24577 hierarchical.cpp:175] Initialized hierarchical allocator process I1208 02:27:51.542667 24577 whitelist_watcher.cpp:77] No whitelist given I1208 02:27:51.543349 24571 master.cpp:2214] Elected as the leading master! I1208 02:27:51.543365 24571 master.cpp:1694] Recovering from registrar I1208 02:27:51.543426 24576 registrar.cpp:347] Recovering registrar I1208 02:27:51.543519 24576 registrar.cpp:391] Successfully fetched the registry (0B) in 0ns I1208 02:27:51.543546 24576 registrar.cpp:495] Applied 1 operations in 7697ns; attempting to update the registry I1208 02:27:51.543674 24574 registrar.cpp:552] Successfully updated the registry in 0ns I1208 02:27:51.543707 24574 registrar.cpp:424] Successfully recovered registrar I1208 02:27:51.543820 24571 master.cpp:1807] Recovered 0 agents from the registry (172B); allowing 10mins for agents to re-register I1208 02:27:51.543840 24577 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W1208 02:27:51.545620 4488 process.cpp:2756] Attempted to spawn already running process files@172.16.10.30:47245 I1208 02:27:51.545984 4488 containerizer.cpp:304] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I1208 02:27:51.549041 4488 linux_launcher.cpp:146] Using /cgroup/freezer as the freezer hierarchy for the Linux launcher I1208 02:27:51.549407 4488 provisioner.cpp:299] Using default backend 'copy' I1208 02:27:51.549849 4488 cluster.cpp:460] Creating default 'local' authorizer I1208 02:27:51.550534 24574 slave.cpp:258] Mesos agent started on (1222)@172.16.10.30:47245 I1208 02:27:51.550555 24574 slave.cpp:259] Flags at startup: --acls="" --agent_features="capabilities { type: MULTI_ROLE } capabilities { type: HIERARCHICAL_ROLE } capabilities { type: RESERVATION_REFINEMENT } capabilities { type: RESOURCE_PROVIDER } " --appc_s
[jira] [Created] (MESOS-8314) Add authorization to the `GET_RESOURCE_PROVIDER` v1 API call.
Jan Schlicht created MESOS-8314: --- Summary: Add authorization to the `GET_RESOURCE_PROVIDER` v1 API call. Key: MESOS-8314 URL: https://issues.apache.org/jira/browse/MESOS-8314 Project: Mesos Issue Type: Task Components: HTTP API Reporter: Jan Schlicht The {{GET_RESOURCE_PROVIDERS}} call is used to list all resource providers known to a Mesos master or agent. This call needs to be authorized. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8309) Introduce a UUID message type
Jan Schlicht created MESOS-8309: --- Summary: Introduce a UUID message type Key: MESOS-8309 URL: https://issues.apache.org/jira/browse/MESOS-8309 Project: Mesos Issue Type: Task Reporter: Jan Schlicht Assignee: Jan Schlicht Fix For: 1.5.0 Currently when UUID need to be part of a protobuf message, we use a byte array field for that. This has some drawbacks, especially when it comes to outputting the UUID in logs: To stringify the UUID field, we first have to create a stout UUID, then call {{.toString()}} of that one. It would help to have a UUID type in {{mesos.proto}} and provide a stringification function for it in {{type_utils.hpp}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8289) ReservationTest.MasterFailover is flaky when run with `RESOURCE_PROVIDER` capability
Jan Schlicht created MESOS-8289: --- Summary: ReservationTest.MasterFailover is flaky when run with `RESOURCE_PROVIDER` capability Key: MESOS-8289 URL: https://issues.apache.org/jira/browse/MESOS-8289 Project: Mesos Issue Type: Bug Components: test Reporter: Jan Schlicht Assignee: Jan Schlicht Fix For: 1.5.0 On a system under load, {{ResourceProviderCapability/ReservationTest.MasterFailover/1}} can fail. {{GLOG_v=2}} of the failure: {noformat} [ RUN ] ResourceProviderCapability/ReservationTest.MasterFailover/1 I1201 14:52:47.324741 122806272 process.cpp:2730] Dropping event for process hierarchical-allocator(34)@172.18.8.37:57116 I1201 14:52:47.324816 122806272 process.cpp:2730] Dropping event for process slave(17)@172.18.8.37:57116 I1201 14:52:47.324859 2720961344 clock.cpp:331] Clock paused at 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.326314 2720961344 clock.cpp:435] Clock of files@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.326371 2720961344 clock.cpp:435] Clock of hierarchical-allocator(35)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.326539 2720961344 cluster.cpp:170] Creating default 'local' authorizer I1201 14:52:47.326568 2720961344 clock.cpp:435] Clock of local-authorizer(52)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.326671 2720961344 clock.cpp:435] Clock of standalone-master-detector(52)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.326709 2720961344 clock.cpp:435] Clock of in-memory-storage(35)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.326884 2720961344 clock.cpp:435] Clock of registrar(35)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.327579 2720961344 clock.cpp:435] Clock of master@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.330301 119050240 master.cpp:454] Master 209387ca-a9c3-4717-9769-a59d9fe927f1 (172.18.8.37) started on 172.18.8.37:57116 I1201 14:52:47.330329 119050240 master.cpp:456] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="5ms" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/z44iHn/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --roles="role" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/z44iHn/master" --zk_session_timeout="10secs" I1201 14:52:47.330628 119050240 master.cpp:505] Master only allowing authenticated frameworks to register I1201 14:52:47.330638 119050240 master.cpp:511] Master only allowing authenticated agents to register I1201 14:52:47.330644 119050240 master.cpp:517] Master only allowing authenticated HTTP frameworks to register I1201 14:52:47.330652 119050240 credentials.hpp:37] Loading credentials for authentication from '/private/var/folders/0b/srgwj7vd2037pygpz1fpyqgmgn/T/z44iHn/credentials' I1201 14:52:47.330873 119050240 master.cpp:561] Using default 'crammd5' authenticator I1201 14:52:47.330927 119050240 clock.cpp:435] Clock of crammd5-authenticator(35)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.330963 119050240 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1201 14:52:47.330993 119050240 clock.cpp:435] Clock of __basic_authenticator__(137)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14:52:47.331056 119050240 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1201 14:52:47.331082 119050240 clock.cpp:435] Clock of __basic_authenticator__(138)@172.18.8.37:57116 updated to 2017-12-01 13:53:04.834857088+00:00 I1201 14
[jira] [Created] (MESOS-8270) Add an agent endpoint to list all active resource providers
Jan Schlicht created MESOS-8270: --- Summary: Add an agent endpoint to list all active resource providers Key: MESOS-8270 URL: https://issues.apache.org/jira/browse/MESOS-8270 Project: Mesos Issue Type: Task Components: agent Reporter: Jan Schlicht Assignee: Jan Schlicht Operators/Frameworks might need information about all resource providers currently running on an agent. An API endpoint should provide that information and include resource provider name and type. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8269) Support resource provider re-subscription in the resource provider manager
Jan Schlicht created MESOS-8269: --- Summary: Support resource provider re-subscription in the resource provider manager Key: MESOS-8269 URL: https://issues.apache.org/jira/browse/MESOS-8269 Project: Mesos Issue Type: Task Reporter: Jan Schlicht Assignee: Jan Schlicht Resource providers may re-subscribe by sending a {{SUBSCRIBE}} call that includes a resource provider ID. Support for this has to be added to the resource provider manager. E.g., the manager should check if a resource provider with the ID exists and use the updated HTTP connection. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8263) ResourceProviderManagerHttpApiTest.ConvertResources is flaky
[ https://issues.apache.org/jira/browse/MESOS-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8263: Sprint: Mesosphere Sprint 68 Story Points: 2 Labels: mesosphere test (was: test) > ResourceProviderManagerHttpApiTest.ConvertResources is flaky > > > Key: MESOS-8263 > URL: https://issues.apache.org/jira/browse/MESOS-8263 > Project: Mesos > Issue Type: Bug > Components: flaky >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, test > > From a ASF CI run: > {noformat} > 3: [ OK ] > ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/0 (1048 ms) > 3: [ RUN ] > ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/1 > 3: I1123 08:06:04.233137 20036 cluster.cpp:162] Creating default 'local' > authorizer > 3: I1123 08:06:04.237293 20060 master.cpp:448] Master > 7c9d8e8c-3fb3-44c5-8505-488ada3e848e (dce3e4c418cb) started on > 172.17.0.2:35090 > 3: I1123 08:06:04.237325 20060 master.cpp:450] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/EpiTO7/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/EpiTO7/master" > --zk_session_timeout="10secs" > 3: I1123 08:06:04.237727 20060 master.cpp:499] Master only allowing > authenticated frameworks to register > 3: I1123 08:06:04.237743 20060 master.cpp:505] Master only allowing > authenticated agents to register > 3: I1123 08:06:04.237753 20060 master.cpp:511] Master only allowing > authenticated HTTP frameworks to register > 3: I1123 08:06:04.237764 20060 credentials.hpp:37] Loading credentials for > authentication from '/tmp/EpiTO7/credentials' > 3: I1123 08:06:04.238149 20060 master.cpp:555] Using default 'crammd5' > authenticator > 3: I1123 08:06:04.238358 20060 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > 3: I1123 08:06:04.238575 20060 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > 3: I1123 08:06:04.238764 20060 http.cpp:1045] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > 3: I1123 08:06:04.238939 20060 master.cpp:634] Authorization enabled > 3: I1123 08:06:04.239159 20043 whitelist_watcher.cpp:77] No whitelist given > 3: I1123 08:06:04.239187 20045 hierarchical.cpp:173] Initialized hierarchical > allocator process > 3: I1123 08:06:04.242822 20041 master.cpp:2215] Elected as the leading master! > 3: I1123 08:06:04.242857 20041 master.cpp:1695] Recovering from registrar > 3: I1123 08:06:04.243067 20052 registrar.cpp:347] Recovering registrar > 3: I1123 08:06:04.243808 20052 registrar.cpp:391] Successfully fetched the > registry (0B) in 690944ns > 3: I1123 08:06:04.243953 20052 registrar.cpp:495] Applied 1 operations in > 37370ns; attempting to update the registry > 3: I1123 08:06:04.244638 20052 registrar.cpp:552] Successfully updated the > registry in 620032ns > 3: I1123 08:06:04.244798 20052 registrar.cpp:424] Successfully recovered > registrar > 3: I1123 08:06:04.245352 20058 hierarchical.cpp:211] Skipping recovery of > hierarchical allocator: nothing to recover > 3: I1123 08:06:04.245358 20057 master.cpp:1808] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > 3: W1123 08:06:04.251852 20036 process.cpp:2756] Attempted to spawn already > running process files@172.17.0.2:35090 > 3: I1123 08:06:04.253250 20036 containerizer.cpp:301] Using isolation { > environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } > 3: W1123 08:06:04.253965 20036 backend.cpp:76] F
[jira] [Created] (MESOS-8263) ResourceProviderManagerHttpApiTest.ConvertResources is flaky
Jan Schlicht created MESOS-8263: --- Summary: ResourceProviderManagerHttpApiTest.ConvertResources is flaky Key: MESOS-8263 URL: https://issues.apache.org/jira/browse/MESOS-8263 Project: Mesos Issue Type: Bug Components: flaky Reporter: Jan Schlicht Assignee: Jan Schlicht >From a ASF CI run: {noformat} 3: [ OK ] ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/0 (1048 ms) 3: [ RUN ] ContentType/ResourceProviderManagerHttpApiTest.ConvertResources/1 3: I1123 08:06:04.233137 20036 cluster.cpp:162] Creating default 'local' authorizer 3: I1123 08:06:04.237293 20060 master.cpp:448] Master 7c9d8e8c-3fb3-44c5-8505-488ada3e848e (dce3e4c418cb) started on 172.17.0.2:35090 3: I1123 08:06:04.237325 20060 master.cpp:450] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/EpiTO7/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/EpiTO7/master" --zk_session_timeout="10secs" 3: I1123 08:06:04.237727 20060 master.cpp:499] Master only allowing authenticated frameworks to register 3: I1123 08:06:04.237743 20060 master.cpp:505] Master only allowing authenticated agents to register 3: I1123 08:06:04.237753 20060 master.cpp:511] Master only allowing authenticated HTTP frameworks to register 3: I1123 08:06:04.237764 20060 credentials.hpp:37] Loading credentials for authentication from '/tmp/EpiTO7/credentials' 3: I1123 08:06:04.238149 20060 master.cpp:555] Using default 'crammd5' authenticator 3: I1123 08:06:04.238358 20060 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' 3: I1123 08:06:04.238575 20060 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' 3: I1123 08:06:04.238764 20060 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' 3: I1123 08:06:04.238939 20060 master.cpp:634] Authorization enabled 3: I1123 08:06:04.239159 20043 whitelist_watcher.cpp:77] No whitelist given 3: I1123 08:06:04.239187 20045 hierarchical.cpp:173] Initialized hierarchical allocator process 3: I1123 08:06:04.242822 20041 master.cpp:2215] Elected as the leading master! 3: I1123 08:06:04.242857 20041 master.cpp:1695] Recovering from registrar 3: I1123 08:06:04.243067 20052 registrar.cpp:347] Recovering registrar 3: I1123 08:06:04.243808 20052 registrar.cpp:391] Successfully fetched the registry (0B) in 690944ns 3: I1123 08:06:04.243953 20052 registrar.cpp:495] Applied 1 operations in 37370ns; attempting to update the registry 3: I1123 08:06:04.244638 20052 registrar.cpp:552] Successfully updated the registry in 620032ns 3: I1123 08:06:04.244798 20052 registrar.cpp:424] Successfully recovered registrar 3: I1123 08:06:04.245352 20058 hierarchical.cpp:211] Skipping recovery of hierarchical allocator: nothing to recover 3: I1123 08:06:04.245358 20057 master.cpp:1808] Recovered 0 agents from the registry (129B); allowing 10mins for agents to re-register 3: W1123 08:06:04.251852 20036 process.cpp:2756] Attempted to spawn already running process files@172.17.0.2:35090 3: I1123 08:06:04.253250 20036 containerizer.cpp:301] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } 3: W1123 08:06:04.253965 20036 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges 3: W1123 08:06:04.254109 20036 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root privileges 3: I1123 08:06:04.254148 20036 provisioner.cpp:259] Using default backend 'copy' 3: I1123 08:06:04.256542 20036 cluster.cpp:448] Creating default 'local' authorizer 3: I1123 08:06:04.260066 20057 slave.cpp:262] Mesos agent started on (784)@172.17.0.2:35090 3: I1123 08:06:04.
[jira] [Comment Edited] (MESOS-8211) Handle agent local resources in offer operation handler
[ https://issues.apache.org/jira/browse/MESOS-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249372#comment-16249372 ] Jan Schlicht edited comment on MESOS-8211 at 11/14/17 2:14 PM: --- https://reviews.apache.org/r/63751/ https://reviews.apache.org/r/63797/ was (Author: nfnt): https://reviews.apache.org/r/63751/ > Handle agent local resources in offer operation handler > --- > > Key: MESOS-8211 > URL: https://issues.apache.org/jira/browse/MESOS-8211 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > The master will send {{ApplyOfferOperationMessage}} instead of > {{CheckpointResourcesMessage}} when an agent has the 'RESOURCE_PROVIDER' > capability set. The agent handler for the message needs to be updated to > support operations on agent resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8218) Support `RESERVE`/`CREATE` operations with resource providers
[ https://issues.apache.org/jira/browse/MESOS-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8218: Shepherd: Jie Yu > Support `RESERVE`/`CREATE` operations with resource providers > - > > Key: MESOS-8218 > URL: https://issues.apache.org/jira/browse/MESOS-8218 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > {{RESERVE}}/{{UNRESERVE}}/{{CREATE}}/{{DESTROY}} operations should work with > resource provider resources like they do with agent resources. I.e. they will > be speculatively applied and an offer operation will be sent to the > respective resource provider. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8218) Support `RESERVE`/`CREATE` operations with resource providers
[ https://issues.apache.org/jira/browse/MESOS-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-8218: --- Assignee: Jan Schlicht > Support `RESERVE`/`CREATE` operations with resource providers > - > > Key: MESOS-8218 > URL: https://issues.apache.org/jira/browse/MESOS-8218 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > {{RESERVE}}/{{UNRESERVE}}/{{CREATE}}/{{DESTROY}} operations should work with > resource provider resources like they do with agent resources. I.e. they will > be speculatively applied and an offer operation will be sent to the > respective resource provider. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8218) Support `RESERVE`/`CREATE` operations with resource providers
Jan Schlicht created MESOS-8218: --- Summary: Support `RESERVE`/`CREATE` operations with resource providers Key: MESOS-8218 URL: https://issues.apache.org/jira/browse/MESOS-8218 Project: Mesos Issue Type: Task Reporter: Jan Schlicht {{RESERVE}}/{{UNRESERVE}}/{{CREATE}}/{{DESTROY}} operations should work with resource provider resources like they do with agent resources. I.e. they will be speculatively applied and an offer operation will be sent to the respective resource provider. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8211) Handle agent local resources in offer operation handler
Jan Schlicht created MESOS-8211: --- Summary: Handle agent local resources in offer operation handler Key: MESOS-8211 URL: https://issues.apache.org/jira/browse/MESOS-8211 Project: Mesos Issue Type: Task Components: agent Reporter: Jan Schlicht Assignee: Jan Schlicht The master will send {{ApplyOfferOperationMessage}} instead of {{CheckpointResourcesMessage}} when an agent has the 'RESOURCE_PROVIDER' capability set. The agent handler for the message needs to be updated to support operations on agent resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7594) Implement 'apply' for resource provider related operations
[ https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150341#comment-16150341 ] Jan Schlicht edited comment on MESOS-7594 at 10/18/17 2:35 PM: --- https://reviews.apache.org/r/63104/ https://reviews.apache.org/r/61810/ https://reviews.apache.org/r/61946/ https://reviews.apache.org/r/63105/ https://reviews.apache.org/r/61947/ was (Author: nfnt): https://reviews.apache.org/r/61810/ https://reviews.apache.org/r/61946/ https://reviews.apache.org/r/61947/ > Implement 'apply' for resource provider related operations > -- > > Key: MESOS-7594 > URL: https://issues.apache.org/jira/browse/MESOS-7594 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > > Resource providers provide new offer operations ({{CREATE_BLOCK}}, > {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations > can be applied by frameworks when they accept on offer. Handling of these > operations has to be added to the master's {{accept}} call. I.e. the > corresponding resource provider needs be extracted from the offer's resources > and a {{resource_provider::Event::OPERATION}} has to be sent to the resource > provider. The resource provider will answer with a > {{resource_provider::Call::Update}} which needs to be handled as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8087) Add operation status update handler in Master.
[ https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8087: Sprint: Mesosphere Sprint 65 Story Points: 5 Labels: mesosphere (was: ) > Add operation status update handler in Master. > -- > > Key: MESOS-8087 > URL: https://issues.apache.org/jira/browse/MESOS-8087 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jan Schlicht > Labels: mesosphere > > Please follow this doc for details. > https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit# > This handler will process operation status update from resource providers. > Depends on whether it's old or new operations, the logic is slightly > different. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8089) Add messages to publish resources on a resource provider
[ https://issues.apache.org/jira/browse/MESOS-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8089: Sprint: Mesosphere Sprint 65 (was: Mesosphere Sprint 66) > Add messages to publish resources on a resource provider > > > Key: MESOS-8089 > URL: https://issues.apache.org/jira/browse/MESOS-8089 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > Before launching a task that uses resource provider resources, the resource > provider needs to be informed to "publish" these resources as it may take > some necessary actions. For external resource providers resources might also > have to be "unpublished" when a task is finished. The resource provider needs > to ack these calls after it's ready. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8087) Add operation status update handler in Master.
[ https://issues.apache.org/jira/browse/MESOS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-8087: --- Assignee: Jan Schlicht > Add operation status update handler in Master. > -- > > Key: MESOS-8087 > URL: https://issues.apache.org/jira/browse/MESOS-8087 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jan Schlicht > > Please follow this doc for details. > https://docs.google.com/document/d/1RrrLVATZUyaURpEOeGjgxA6ccshuLo94G678IbL-Yco/edit# > This handler will process operation status update from resource providers. > Depends on whether it's old or new operations, the logic is slightly > different. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8089) Add messages to publish resources on a resource provider
[ https://issues.apache.org/jira/browse/MESOS-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8089: Sprint: Mesosphere Sprint 66 Story Points: 7 > Add messages to publish resources on a resource provider > > > Key: MESOS-8089 > URL: https://issues.apache.org/jira/browse/MESOS-8089 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > Before launching a task that uses resource provider resources, the resource > provider needs to be informed to "publish" these resources as it may take > some necessary actions. For external resource providers resources might also > have to be "unpublished" when a task is finished. The resource provider needs > to ack these calls after it's ready. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8089) Add messages to publish resources on a resource provider
Jan Schlicht created MESOS-8089: --- Summary: Add messages to publish resources on a resource provider Key: MESOS-8089 URL: https://issues.apache.org/jira/browse/MESOS-8089 Project: Mesos Issue Type: Task Reporter: Jan Schlicht Assignee: Jan Schlicht Before launching a task that uses resource provider resources, the resource provider needs to be informed to "publish" these resources as it may take some necessary actions. For external resource providers resources might also have to be "unpublished" when a task is finished. The resource provider needs to ack these calls after it's ready. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174656#comment-16174656 ] Jan Schlicht commented on MESOS-7995: - Forgot to mention it: Mine's also a SSL build (--enable-libevent --enable-ssl), using libevent 2.0.22. Latest HEAD (c0293a6f7d457a595a3763662e3a9740db31859b). > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIf
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174636#comment-16174636 ] Jan Schlicht commented on MESOS-7995: - Is there something specific different in your environment? Can't reproduce this on macOS 10.13, Apple Clang 9.0.0. All libprocess tests are successful. > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupport
[jira] [Updated] (MESOS-7594) Implement 'apply' for resource provider related operations
[ https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7594: Story Points: 5 > Implement 'apply' for resource provider related operations > -- > > Key: MESOS-7594 > URL: https://issues.apache.org/jira/browse/MESOS-7594 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > > Resource providers provide new offer operations ({{CREATE_BLOCK}}, > {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations > can be applied by frameworks when they accept on offer. Handling of these > operations has to be added to the master's {{accept}} call. I.e. the > corresponding resource provider needs be extracted from the offer's resources > and a {{resource_provider::Event::OPERATION}} has to be sent to the resource > provider. The resource provider will answer with a > {{resource_provider::Call::Update}} which needs to be handled as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7594) Implement 'apply' for resource provider related operations
[ https://issues.apache.org/jira/browse/MESOS-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7594: Sprint: Mesosphere Sprint 57, Mesosphere Sprint 62 (was: Mesosphere Sprint 57) > Implement 'apply' for resource provider related operations > -- > > Key: MESOS-7594 > URL: https://issues.apache.org/jira/browse/MESOS-7594 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > > Resource providers provide new offer operations ({{CREATE_BLOCK}}, > {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations > can be applied by frameworks when they accept on offer. Handling of these > operations has to be added to the master's {{accept}} call. I.e. the > corresponding resource provider needs be extracted from the offer's resources > and a {{resource_provider::Event::OPERATION}} has to be sent to the resource > provider. The resource provider will answer with a > {{resource_provider::Call::Update}} which needs to be handled as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7469) Add resource provider driver.
[ https://issues.apache.org/jira/browse/MESOS-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7469: Sprint: Mesosphere Sprint 56, Mesosphere Sprint 60 (was: Mesosphere Sprint 56) > Add resource provider driver. > - > > Key: MESOS-7469 > URL: https://issues.apache.org/jira/browse/MESOS-7469 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jan Schlicht > Labels: storage > > Similar to scheduler/executor driver, resource provider driver will be used > to connect the resource provider and the Resource provider manager (resides > in either agent for local resource providers, or master for external resource > providers). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7816) Add HTTP connection handling to the resource provider driver
[ https://issues.apache.org/jira/browse/MESOS-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7816: Labels: mesosphere storage (was: mesosphere) > Add HTTP connection handling to the resource provider driver > > > Key: MESOS-7816 > URL: https://issues.apache.org/jira/browse/MESOS-7816 > Project: Mesos > Issue Type: Task > Components: storage >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > > The {{resource_provider::Driver}} is responsible for establishing a > connection with an agent/master resource provider API and provide calls to > the API, receive events from the API. This is done using HTTP and should be > implemented similar to how it's done for schedulers and executors (see > {{src/executor/executor.cpp, src/scheduler/scheduler.cpp}}). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7816) Add HTTP connection handling to the resource provider driver
Jan Schlicht created MESOS-7816: --- Summary: Add HTTP connection handling to the resource provider driver Key: MESOS-7816 URL: https://issues.apache.org/jira/browse/MESOS-7816 Project: Mesos Issue Type: Task Components: storage Reporter: Jan Schlicht Assignee: Jan Schlicht The {{resource_provider::Driver}} is responsible for establishing a connection with an agent/master resource provider API and provide calls to the API, receive events from the API. This is done using HTTP and should be implemented similar to how it's done for schedulers and executors (see {{src/executor/executor.cpp, src/scheduler/scheduler.cpp}}). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager
[ https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7780: Story Points: 5 > Add `SUBSCRIBE` call handling to the resource provider manager > -- > > Key: MESOS-7780 > URL: https://issues.apache.org/jira/browse/MESOS-7780 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: storage > > Resource providers will use the HTTP API to subscribe to the > {{ResourceProviderManager}}. Handling these calls needs to be implemented. On > subscription, a unique resource provider ID will be assigned to the resource > provider and a {{SUBSCRIBED}} event will be sent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager
[ https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7780: Sprint: Mesosphere Sprint 59 > Add `SUBSCRIBE` call handling to the resource provider manager > -- > > Key: MESOS-7780 > URL: https://issues.apache.org/jira/browse/MESOS-7780 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: storage > > Resource providers will use the HTTP API to subscribe to the > {{ResourceProviderManager}}. Handling these calls needs to be implemented. On > subscription, a unique resource provider ID will be assigned to the resource > provider and a {{SUBSCRIBED}} event will be sent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager
[ https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7780: Sprint: (was: Mesosphere Sprint 59) > Add `SUBSCRIBE` call handling to the resource provider manager > -- > > Key: MESOS-7780 > URL: https://issues.apache.org/jira/browse/MESOS-7780 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: storage > > Resource providers will use the HTTP API to subscribe to the > {{ResourceProviderManager}}. Handling these calls needs to be implemented. On > subscription, a unique resource provider ID will be assigned to the resource > provider and a {{SUBSCRIBED}} event will be sent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7529) Realm names defined for tests are used in main Mesos code
[ https://issues.apache.org/jira/browse/MESOS-7529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085355#comment-16085355 ] Jan Schlicht commented on MESOS-7529: - Thanks! Don't know why I didn't see the other definitions. > Realm names defined for tests are used in main Mesos code > - > > Key: MESOS-7529 > URL: https://issues.apache.org/jira/browse/MESOS-7529 > Project: Mesos > Issue Type: Bug >Reporter: Jan Schlicht >Priority: Minor > Labels: tech-debt > > In {{process/gtest.hpp}} the realms {{READONLY_HTTP_AUTHENTICATION_REALM}} > and {{READWRITE_HTTP_AUTHENTICATION_REALM}} are defined. These are then used > in {{master/main.cpp}} and {{slave/main.cpp}}. I'd expect that these would > only be used in tests or these realms should be defined elsewhere. > Also the concept of having these two realms seems specific to Mesos, not > libprocess, hence it would make sense to define them somewhere in Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager
Jan Schlicht created MESOS-7780: --- Summary: Add `SUBSCRIBE` call handling to the resource provider manager Key: MESOS-7780 URL: https://issues.apache.org/jira/browse/MESOS-7780 Project: Mesos Issue Type: Task Reporter: Jan Schlicht Assignee: Jan Schlicht Resource providers will use the HTTP API to subscribe to the {{ResourceProviderManager}}. Handling these calls needs to be implemented. On subscription, a unique resource provider ID will be assigned to the resource provider and a {{SUBSCRIBED}} event will be sent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7758) Stout doesn't build standalone.
[ https://issues.apache.org/jira/browse/MESOS-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074393#comment-16074393 ] Jan Schlicht commented on MESOS-7758: - Libprocess is affected as well. {noformat} $ cd build/3rdparty/libprocess $ make ... make[1]: *** No rule to make target `googlemock-build-stamp'. Stop. make: *** [../googletest-release-1.8.0/googlemock-build-stamp] Error 2 {noformat} > Stout doesn't build standalone. > --- > > Key: MESOS-7758 > URL: https://issues.apache.org/jira/browse/MESOS-7758 > Project: Mesos > Issue Type: Bug > Components: build, stout >Reporter: James Peach > > Stout doesn't build in a standalone configuration: > {noformat} > $ cd ~/src/mesos/3rdparty/stout > $ ./bootstrap > $ cd ~/build/stout > $ ~/src/mesos/3rdparty/stout/configure > ... > $ make > ... > make[1]: Leaving directory '/home/vagrant/build/stout/3rdparty' > make[1]: Entering directory '/home/vagrant/build/stout/3rdparty' > make[1]: *** No rule to make target 'googlemock-build-stamp'. Stop. > make[1]: Leaving directory '/home/vagrant/build/stout/3rdparty' > make: *** [Makefile:1902: > 3rdparty/googletest-release-1.8.0/googlemock-build-stamp] Error 2 > {noformat} > Note that the build expects > {{3rdparty/googletest-release-1.8.0/googlemock-build-stamp}}, but > {{googletest}} hasn't been staged yet: > {noformat} > [vagrant@fedora-26 stout]$ ls -l 3rdparty/ > total 44 > drwxr-xr-x. 3 vagrant vagrant 4096 Jan 18 2016 boost-1.53.0 > -rw-rw-r--. 1 vagrant vagrant 0 Jul 5 06:16 boost-1.53.0-stamp > drwxrwxr-x. 8 vagrant vagrant 4096 Aug 15 2016 elfio-3.2 > -rw-rw-r--. 1 vagrant vagrant 0 Jul 5 06:16 elfio-3.2-stamp > drwxr-xr-x. 10 vagrant vagrant 4096 Jul 5 06:16 glog-0.3.3 > -rw-rw-r--. 1 vagrant vagrant 0 Jul 5 06:16 glog-0.3.3-build-stamp > -rw-rw-r--. 1 vagrant vagrant 0 Jul 5 06:16 glog-0.3.3-stamp > -rw-rw-r--. 1 vagrant vagrant 734 Jul 5 06:03 gmock_sources.cc > -rw-rw-r--. 1 vagrant vagrant 25657 Jul 5 06:03 Makefile > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7696) Update resource provider design in the master
Jan Schlicht created MESOS-7696: --- Summary: Update resource provider design in the master Key: MESOS-7696 URL: https://issues.apache.org/jira/browse/MESOS-7696 Project: Mesos Issue Type: Task Components: master Reporter: Jan Schlicht Assignee: Jan Schlicht Some discussion around how to use the allocator result in changes to how local resource providers and external resource providers should be handled in the master. The current approach needs to be updated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7595) Implement local resource provider registration
[ https://issues.apache.org/jira/browse/MESOS-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-7595: --- Assignee: Jan Schlicht > Implement local resource provider registration > -- > > Key: MESOS-7595 > URL: https://issues.apache.org/jira/browse/MESOS-7595 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should > add that one to the list of registered resource providers in the master. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7595) Implement local resource provider registration
Jan Schlicht created MESOS-7595: --- Summary: Implement local resource provider registration Key: MESOS-7595 URL: https://issues.apache.org/jira/browse/MESOS-7595 Project: Mesos Issue Type: Task Components: master Reporter: Jan Schlicht A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should add that one to the list of registered resource providers in the master. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7595) Implement local resource provider registration
[ https://issues.apache.org/jira/browse/MESOS-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-7595: Shepherd: Jie Yu > Implement local resource provider registration > -- > > Key: MESOS-7595 > URL: https://issues.apache.org/jira/browse/MESOS-7595 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should > add that one to the list of registered resource providers in the master. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7594) Implement 'apply' for resource provider related operations
Jan Schlicht created MESOS-7594: --- Summary: Implement 'apply' for resource provider related operations Key: MESOS-7594 URL: https://issues.apache.org/jira/browse/MESOS-7594 Project: Mesos Issue Type: Task Components: master Reporter: Jan Schlicht Assignee: Jan Schlicht Resource providers provide new offer operations ({{CREATE_BLOCK}}, {{DESTROY_BLOCK}}, {{CREATE_VOLUME}}, {{DESTROY_VOLUME}}). These operations can be applied by frameworks when they accept on offer. Handling of these operations has to be added to the master's {{accept}} call. I.e. the corresponding resource provider needs be extracted from the offer's resources and a {{resource_provider::Event::OPERATION}} has to be sent to the resource provider. The resource provider will answer with a {{resource_provider::Call::Update}} which needs to be handled as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)