[jira] [Updated] (MESOS-4052) Simple hook implementation proxying out to another daemon process
[ https://issues.apache.org/jira/browse/MESOS-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-4052: - Description: Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would need to maintain the compiling, building and packaging of a dynamically linked library in c++ in house. Designs like [Docker's Volume plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires user to implement a predefined REST API in any language and listen at a domain socket. This would be more flexible for companies that does not use c++ as primary language. This ticket is exploring the possibility of whether Mesos could provide a default module that 1) defines such API and 2) proxies out to the external agent for any heavy lifting. Please let me know whether you think is seems like a reasonable feature/requirement. I'm more than happy to work on this than maintain this hook in house in the longer term. was: Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would need to maintain the compiling, building and packaging of a dynamically linked library in c++ in house. Designs like [Docker's Volume plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires user to implement a predefined REST API in any language and listen at a domain socket. This would be more flexible for companies that does not use c++ as primary language. This ticket is exploring the possibility of whether Mesos could provide a default module that 1) defines such API and 2) proxies out to the external agent for any heavy lifting. I'm more than happy to work on this than maintain this hook in house in the longer term. > Simple hook implementation proxying out to another daemon process > - > > Key: MESOS-4052 > URL: https://issues.apache.org/jira/browse/MESOS-4052 > Project: Mesos > Issue Type: Wish > Components: modules >Reporter: Zhitao Li >Priority: Minor > > Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they > would need to maintain the compiling, building and packaging of a dynamically > linked library in c++ in house. > Designs like [Docker's Volume > plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires > user to implement a predefined REST API in any language and listen at a > domain socket. This would be more flexible for companies that does not use > c++ as primary language. > This ticket is exploring the possibility of whether Mesos could provide a > default module that 1) defines such API and 2) proxies out to the external > agent for any heavy lifting. > Please let me know whether you think is seems like a reasonable > feature/requirement. > I'm more than happy to work on this than maintain this hook in house in the > longer term. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4052) Simple hook implementation proxying out to another daemon process
Zhitao Li created MESOS-4052: Summary: Simple hook implementation proxying out to another daemon process Key: MESOS-4052 URL: https://issues.apache.org/jira/browse/MESOS-4052 Project: Mesos Issue Type: Wish Components: modules Reporter: Zhitao Li Priority: Minor Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would need to maintain the compiling, building and packaging of a dynamically linked library in c++ in house. Designs like [Docker's Volume plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires user to implement a predefined REST API in any language and listen at a domain socket. This would be more flexible for companies that does not use c++ as primary language. This ticket is exploring the possibility of whether Mesos could provide a default module that 1) defines such API and 2) proxies out to the external agent for any heavy lifting. I'm more than happy to work on this than maintain this hook in house in the longer term. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-313) Report executor terminations to framework schedulers.
[ https://issues.apache.org/jira/browse/MESOS-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-313: Summary: Report executor terminations to framework schedulers. (was: report executor deaths to framework schedulers) > Report executor terminations to framework schedulers. > - > > Key: MESOS-313 > URL: https://issues.apache.org/jira/browse/MESOS-313 > Project: Mesos > Issue Type: Improvement >Reporter: Charles Reiss >Assignee: Zhitao Li > Labels: mesosphere, newbie > > The Scheduler interface has a callback for executorLost, but currently it is > never called. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox
[ https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071374#comment-15071374 ] Zhitao Li commented on MESOS-3413: -- It seem that this task is marked as "Won't fix". I wonder whether it's possible to come up with some short fix for the existing docker containerizer so users don't need to blocked until new Unified Containerizer is ready. There seem to be two possible "quick" fixes: - symlink the persistent volume into sandbox; - directly mount the persistent volume. The getPersistentVolumePath() in src/slave/paths.cpp is already available for this purpose. Can someone comment on this possibility? > Docker containerizer does not symlink persistent volumes into sandbox > - > > Key: MESOS-3413 > URL: https://issues.apache.org/jira/browse/MESOS-3413 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave >Affects Versions: 0.23.0 >Reporter: Max Neunhöffer >Assignee: haosdent > Original Estimate: 1h > Remaining Estimate: 1h > > For the ArangoDB framework I am trying to use the persistent primitives. > nearly all is working, but I am missing a crucial piece at the end: I have > successfully created a persistent disk resource and have set the persistence > and volume information in the DiskInfo message. However, I do not see any way > to find out what directory on the host the mesos slave has reserved for us. I > know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we > have no way to query this information anywhere. The docker containerizer does > not automatically mount this directory into our docker container, or symlinks > it into our sandbox. Therefore, I have essentially no access to it. Note that > the mesos containerizer (which I cannot use for other reasons) seems to > create a symlink in the sandbox to the actual path for the persistent volume. > With that, I could mount the volume into our docker container and all would > be well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox
[ https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071374#comment-15071374 ] Zhitao Li edited comment on MESOS-3413 at 12/29/15 6:25 PM: It seem that this task is marked as "Won't fix". I wonder whether it's possible to come up with some short fix for the existing docker containerizer so users don't need to blocked until new Unified Containerizer is ready. There seem to be two possible ways to fix: - add some feedback response from slave to master/allocator to the resource offer containing the hostPath so offer resources containing this persisted volume has non-empty hostPath; - at launch time, detect that a persisted volume is included in the ContainerInfo, and automatically mount the resolved host path. The getPersistentVolumePath() in src/slave/paths.cpp is already available for this purpose. Can someone comment on this possibility? was (Author: zhitao): It seem that this task is marked as "Won't fix". I wonder whether it's possible to come up with some short fix for the existing docker containerizer so users don't need to blocked until new Unified Containerizer is ready. There seem to be two possible "quick" fixes: - symlink the persistent volume into sandbox; - directly mount the persistent volume. The getPersistentVolumePath() in src/slave/paths.cpp is already available for this purpose. Can someone comment on this possibility? > Docker containerizer does not symlink persistent volumes into sandbox > - > > Key: MESOS-3413 > URL: https://issues.apache.org/jira/browse/MESOS-3413 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave >Affects Versions: 0.23.0 >Reporter: Max Neunhöffer >Assignee: haosdent > Original Estimate: 1h > Remaining Estimate: 1h > > For the ArangoDB framework I am trying to use the persistent primitives. > nearly all is working, but I am missing a crucial piece at the end: I have > successfully created a persistent disk resource and have set the persistence > and volume information in the DiskInfo message. However, I do not see any way > to find out what directory on the host the mesos slave has reserved for us. I > know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we > have no way to query this information anywhere. The docker containerizer does > not automatically mount this directory into our docker container, or symlinks > it into our sandbox. Therefore, I have essentially no access to it. Note that > the mesos containerizer (which I cannot use for other reasons) seems to > create a symlink in the sandbox to the actual path for the persistent volume. > With that, I could mount the volume into our docker container and all would > be well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox
[ https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074220#comment-15074220 ] Zhitao Li commented on MESOS-3413: -- [~jieyu], thanks for the reply. My comment about symlink was indeed incorrect and I updated my previous comment. I think returning the host path in the resource offer could still work though, as long as Mesos slave would not change the real location of the volume created. For the fact of unable to update volumes for a running container is a limitation, I acknowledge that's a limitation , but I don't really understand why it's a show stopper. Making it clear to users that for DockerContainerizer, all persistent volumes must be created and mounted before the executor/container is created still sounds reasonable to me, and it allows us to use current DockerContainerizer until the new Unified Containerizer is available and covers other features we may need from docker engine. One of the reasons I really want this is that our in-house database system (which we are looking towards to run on Mesos) requires running multiple mysqld instances on the same machine, and that team already spent quite some time to dockerize these instances for easy configuration and isolation purpose. Thanks for you time again. I really looking forward to using persistent volume primitives! > Docker containerizer does not symlink persistent volumes into sandbox > - > > Key: MESOS-3413 > URL: https://issues.apache.org/jira/browse/MESOS-3413 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave >Affects Versions: 0.23.0 >Reporter: Max Neunhöffer >Assignee: haosdent > Original Estimate: 1h > Remaining Estimate: 1h > > For the ArangoDB framework I am trying to use the persistent primitives. > nearly all is working, but I am missing a crucial piece at the end: I have > successfully created a persistent disk resource and have set the persistence > and volume information in the DiskInfo message. However, I do not see any way > to find out what directory on the host the mesos slave has reserved for us. I > know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we > have no way to query this information anywhere. The docker containerizer does > not automatically mount this directory into our docker container, or symlinks > it into our sandbox. Therefore, I have essentially no access to it. Note that > the mesos containerizer (which I cannot use for other reasons) seems to > create a symlink in the sandbox to the actual path for the persistent volume. > With that, I could mount the volume into our docker container and all would > be well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox
[ https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074278#comment-15074278 ] Zhitao Li commented on MESOS-3413: -- Thanks! [~haosd...@gmail.com] Do you have objection for reopening this task? I can try to get this done too if you won't have cycle for this. > Docker containerizer does not symlink persistent volumes into sandbox > - > > Key: MESOS-3413 > URL: https://issues.apache.org/jira/browse/MESOS-3413 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave >Affects Versions: 0.23.0 >Reporter: Max Neunhöffer >Assignee: haosdent > Original Estimate: 1h > Remaining Estimate: 1h > > For the ArangoDB framework I am trying to use the persistent primitives. > nearly all is working, but I am missing a crucial piece at the end: I have > successfully created a persistent disk resource and have set the persistence > and volume information in the DiskInfo message. However, I do not see any way > to find out what directory on the host the mesos slave has reserved for us. I > know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we > have no way to query this information anywhere. The docker containerizer does > not automatically mount this directory into our docker container, or symlinks > it into our sandbox. Therefore, I have essentially no access to it. Note that > the mesos containerizer (which I cannot use for other reasons) seems to > create a symlink in the sandbox to the actual path for the persistent volume. > With that, I could mount the volume into our docker container and all would > be well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4264) DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have memory cgroup mounted
[ https://issues.apache.org/jira/browse/MESOS-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-4264: - Summary: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have memory cgroup mounted (was: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have ) > DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not > have memory cgroup mounted > --- > > Key: MESOS-4264 > URL: https://issues.apache.org/jira/browse/MESOS-4264 > Project: Mesos > Issue Type: Bug > Components: docker, test > Environment: docker: 1.9.1 > EC2 > kernel: > {code:none} > $uname -a > Linux zhitao-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 > (2015-09-19) x86_64 GNU/Linux > $ mount | grep cgroup > tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) > cgroup on /sys/fs/cgroup/systemd type cgroup > (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) > cgroup on /sys/fs/cgroup/cpuset type cgroup > (rw,nosuid,nodev,noexec,relatime,cpuset) > cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup > (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) > cgroup on /sys/fs/cgroup/devices type cgroup > (rw,nosuid,nodev,noexec,relatime,devices) > cgroup on /sys/fs/cgroup/freezer type cgroup > (rw,nosuid,nodev,noexec,relatime,freezer) > cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup > (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) > cgroup on /sys/fs/cgroup/blkio type cgroup > (rw,nosuid,nodev,noexec,relatime,blkio) > cgroup on /sys/fs/cgroup/perf_event type cgroup > (rw,nosuid,nodev,noexec,relatime,perf_event) > {code} >Reporter: Zhitao Li >Priority: Minor > > With debug enabled, seeing following failure when running the tests as root: > {code:none} > [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Usage > ABORT: > (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:109): > Result::get() but state == NONE > *** Aborted at 1451549845 (unix time) try "date -d @1451549845" if you are > using GNU date *** > PC: @ 0x7f9528ac7107 (unknown) > *** SIGABRT (@0x18be0) received by PID 101344 (TID 0x7f951ef0e700) from PID > 101344; stack trace: *** > @ 0x7f9529a788d0 (unknown) > @ 0x7f9528ac7107 (unknown) > @ 0x7f9528ac84e8 (unknown) > @ 0x96dd99 _Abort() > @ 0x96ddc7 _Abort() > @ 0x9c8714 Result<>::get() > @ 0x7f952d871bef > mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics() > @ 0x7f952d870bc2 > _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi > @ 0x7f952d871121 > _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_ > @ 0x7f952d877d8d > _ZZZNK7process9_DeferredIZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv > @ 0x7f952d87b8dd > _ZNSt17_Function_handlerIFN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f952d8ac919 std::function<>::operator()() > @ 0x7f952d8a0b2a > _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_ > @ 0x7f952d8b5bc1 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f952e16270f std::function<>::operator()() > @ 0x7f952e1479fe process::ProcessBase::visit() > @ 0x7f952e14d9ba process::DispatchEvent::visit() > @ 0x96ed2e process::ProcessBase::serve() > @ 0x7f952e143cda process::ProcessManager::resume() > @ 0x7f952e140ded > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f952e14d17a > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f952e14d128 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f952e14d0b8 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEv
[jira] [Created] (MESOS-4264) DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have
Zhitao Li created MESOS-4264: Summary: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have Key: MESOS-4264 URL: https://issues.apache.org/jira/browse/MESOS-4264 Project: Mesos Issue Type: Bug Components: docker, test Environment: docker: 1.9.1 EC2 kernel: {code:none} $uname -a Linux zhitao-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux $ mount | grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) {code} Reporter: Zhitao Li Priority: Minor With debug enabled, seeing following failure when running the tests as root: {code:none} [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Usage ABORT: (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:109): Result::get() but state == NONE *** Aborted at 1451549845 (unix time) try "date -d @1451549845" if you are using GNU date *** PC: @ 0x7f9528ac7107 (unknown) *** SIGABRT (@0x18be0) received by PID 101344 (TID 0x7f951ef0e700) from PID 101344; stack trace: *** @ 0x7f9529a788d0 (unknown) @ 0x7f9528ac7107 (unknown) @ 0x7f9528ac84e8 (unknown) @ 0x96dd99 _Abort() @ 0x96ddc7 _Abort() @ 0x9c8714 Result<>::get() @ 0x7f952d871bef mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics() @ 0x7f952d870bc2 _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi @ 0x7f952d871121 _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_ @ 0x7f952d877d8d _ZZZNK7process9_DeferredIZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv @ 0x7f952d87b8dd _ZNSt17_Function_handlerIFN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invokeERKSt9_Any_data @ 0x7f952d8ac919 std::function<>::operator()() @ 0x7f952d8a0b2a _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_ @ 0x7f952d8b5bc1 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ @ 0x7f952e16270f std::function<>::operator()() @ 0x7f952e1479fe process::ProcessBase::visit() @ 0x7f952e14d9ba process::DispatchEvent::visit() @ 0x96ed2e process::ProcessBase::serve() @ 0x7f952e143cda process::ProcessManager::resume() @ 0x7f952e140ded _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f952e14d17a _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f952e14d128 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f952e14d0b8 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f952e14cffd _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f952e14cf7a _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9529408970 (unknown) @ 0x7f9529a710a4 start_thread @ 0x7f9528b7804d (unknown) {code} I believe this is because we don't check {{memCgroup}} is {{SOME}} before using it in {{Docke
[jira] [Updated] (MESOS-4264) DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM used does not have memory cgroup mounted
[ https://issues.apache.org/jira/browse/MESOS-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-4264: - Summary: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM used does not have memory cgroup mounted (was: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have memory cgroup mounted) > DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM used does not > have memory cgroup mounted > > > Key: MESOS-4264 > URL: https://issues.apache.org/jira/browse/MESOS-4264 > Project: Mesos > Issue Type: Bug > Components: docker, test > Environment: docker: 1.9.1 > EC2 > kernel: > {code:none} > $uname -a > Linux zhitao-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 > (2015-09-19) x86_64 GNU/Linux > $ mount | grep cgroup > tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) > cgroup on /sys/fs/cgroup/systemd type cgroup > (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) > cgroup on /sys/fs/cgroup/cpuset type cgroup > (rw,nosuid,nodev,noexec,relatime,cpuset) > cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup > (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) > cgroup on /sys/fs/cgroup/devices type cgroup > (rw,nosuid,nodev,noexec,relatime,devices) > cgroup on /sys/fs/cgroup/freezer type cgroup > (rw,nosuid,nodev,noexec,relatime,freezer) > cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup > (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) > cgroup on /sys/fs/cgroup/blkio type cgroup > (rw,nosuid,nodev,noexec,relatime,blkio) > cgroup on /sys/fs/cgroup/perf_event type cgroup > (rw,nosuid,nodev,noexec,relatime,perf_event) > {code} >Reporter: Zhitao Li >Priority: Minor > > With debug enabled, seeing following failure when running the tests as root: > {code:none} > [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Usage > ABORT: > (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:109): > Result::get() but state == NONE > *** Aborted at 1451549845 (unix time) try "date -d @1451549845" if you are > using GNU date *** > PC: @ 0x7f9528ac7107 (unknown) > *** SIGABRT (@0x18be0) received by PID 101344 (TID 0x7f951ef0e700) from PID > 101344; stack trace: *** > @ 0x7f9529a788d0 (unknown) > @ 0x7f9528ac7107 (unknown) > @ 0x7f9528ac84e8 (unknown) > @ 0x96dd99 _Abort() > @ 0x96ddc7 _Abort() > @ 0x9c8714 Result<>::get() > @ 0x7f952d871bef > mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics() > @ 0x7f952d870bc2 > _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi > @ 0x7f952d871121 > _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_ > @ 0x7f952d877d8d > _ZZZNK7process9_DeferredIZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv > @ 0x7f952d87b8dd > _ZNSt17_Function_handlerIFN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f952d8ac919 std::function<>::operator()() > @ 0x7f952d8a0b2a > _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_ > @ 0x7f952d8b5bc1 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f952e16270f std::function<>::operator()() > @ 0x7f952e1479fe process::ProcessBase::visit() > @ 0x7f952e14d9ba process::DispatchEvent::visit() > @ 0x96ed2e process::ProcessBase::serve() > @ 0x7f952e143cda process::ProcessManager::resume() > @ 0x7f952e140ded > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f952e14d17a > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f952e14d128 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f952e14d0b8 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wra
[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox
[ https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081819#comment-15081819 ] Zhitao Li commented on MESOS-3413: -- [~jieyu] and [~haosd...@gmail.com], I've put up https://reviews.apache.org/r/41892 for a first pass at unblocking DockerContainerizer users to use persistent volumes. Please let me know what you think. > Docker containerizer does not symlink persistent volumes into sandbox > - > > Key: MESOS-3413 > URL: https://issues.apache.org/jira/browse/MESOS-3413 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave >Affects Versions: 0.23.0 >Reporter: Max Neunhöffer >Assignee: haosdent > Original Estimate: 1h > Remaining Estimate: 1h > > For the ArangoDB framework I am trying to use the persistent primitives. > nearly all is working, but I am missing a crucial piece at the end: I have > successfully created a persistent disk resource and have set the persistence > and volume information in the DiskInfo message. However, I do not see any way > to find out what directory on the host the mesos slave has reserved for us. I > know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we > have no way to query this information anywhere. The docker containerizer does > not automatically mount this directory into our docker container, or symlinks > it into our sandbox. Therefore, I have essentially no access to it. Note that > the mesos containerizer (which I cannot use for other reasons) seems to > create a symlink in the sandbox to the actual path for the persistent volume. > With that, I could mount the volume into our docker container and all would > be well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3509) SlaveTest.TerminatingSlaveDoesNotReregister is flaky
[ https://issues.apache.org/jira/browse/MESOS-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099128#comment-15099128 ] Zhitao Li commented on MESOS-3509: -- I'll post some inconclusive findings since my last change seems to made this test from "flaky" to "failing". In short, clock/timer implementation in {{libevent}} seems to allow a timer to be created *out of order* w.r.t to {{Clock::advance()}} in certain cases. As a result, timer created after Clock advancing code was not affected by that and instead waiting for a real wall time of 120s ({{slave::REGISTER_RETRY_INTERVAL_MAX * 2}}), which is longer than the default 15s of {{AWAIT_READY}}. I took a snippet of log to run the test with {{--libevent}} and {{GLOG_v=3}}: {panel} I0114 23:25:16.880738 129300 authenticator.cpp:317] Authentication success I0114 23:25:16.880803 129295 process.cpp:2502] Resuming master@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.880844 129301 process.cpp:2502] Resuming crammd5_authenticator(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.880861 129295 master.cpp:5475] Successfully authenticated principal 'test-principal' at slave(1)@127.0.0.1:31370 I0114 23:25:16.880903 129301 authenticator.cpp:431] Authentication session cleanup for crammd5_authenticatee(3)@127.0.0.1:31370 I0114 23:25:16.880873 129298 process.cpp:2502] Resuming crammd5_authenticatee(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.880975 129301 process.cpp:2800] Donating thread to crammd5_authenticator_session(3)@127.0.0.1:31370 while waiting I0114 23:25:16.880991 129298 authenticatee.cpp:298] Authentication success I0114 23:25:16.881000 129301 process.cpp:2502] Resuming crammd5_authenticator_session(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.881027 129301 process.cpp:2607] Cleaning up crammd5_authenticator_session(3)@127.0.0.1:31370 I0114 23:25:16.881069 129298 process.cpp:2502] Resuming slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.881184 129300 process.cpp:2502] Resuming crammd5_authenticatee(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.881225 129300 process.cpp:2607] Cleaning up crammd5_authenticatee(3)@127.0.0.1:31370 I0114 23:25:16.881301 129298 slave.cpp:860] Successfully authenticated with master master@127.0.0.1:31370 I0114 23:25:16.881631 129296 process.cpp:2502] Resuming master@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.881629 129298 slave.cpp:1254] Will retry registration in 2.172151ms if necessary I0114 23:25:16.881724 129298 clock.cpp:279] Created a timer for slave(1)@127.0.0.1:31370 in 2.172151ms in the future (2016-01-14 23:25:16.880685047+00:00) I0114 23:25:16.881906 129296 master.cpp:4314] Re-registering slave a9a5fba6-3191-424d-a1cf-5d12f35ada17-S0 at slave(1)@127.0.0.1:31370 (localhost) I0114 23:25:16.882228 129294 process.cpp:2502] Resuming slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.882613 129295 process.cpp:2502] Resuming slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.882622 129296 master.cpp:4502] Sending updated checkpointed resources to slave a9a5fba6-3191-424d-a1cf-5d12f35ada17-S0 at slave(1)@127.0.0.1:31370 (localhost) I0114 23:25:16.882712 129295 pid.cpp:93] Attempting to parse 'scheduler-72698b83-ea69-4f94-ac79-1fe005ba5ea9@127.0.0.1:31370' into a PID W0114 23:25:16.882750 129295 slave.cpp:2162] Dropping updateFramework message for a9a5fba6-3191-424d-a1cf-5d12f35ada17- because the slave is in DISCONNECTED state I0114 23:25:16.882935 129295 slave.cpp:2277] Updated checkpointed resources from to I0114 23:25:16.883074 129302 clock.cpp:152] Handling timers up to 2016-01-14 23:25:16.878512896+00:00 I0114 23:25:16.883116 129302 clock.cpp:197] Clock has settled I0114 23:25:16.888927 129289 clock.cpp:465] Clock is settled I0114 23:25:16.889067 129289 clock.cpp:381] Clock advanced (2mins) to 0x20681f0 I0114 23:25:16.889143 129302 clock.cpp:152] Handling timers up to 2016-01-14 23:27:16.878512896+00:00 I0114 23:25:16.889176 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:16.880685047+00:00 I0114 23:25:16.889196 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:16.882896819+00:00 I0114 23:25:16.889209 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:16.965011968+00:00 I0114 23:25:16.889220 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:17.800440064+00:00 I0114 23:25:16.889231 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:18.572319400+00:00 I0114 23:25:16.889242 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:21.561927936+00:00 I0114 23:25:16.889253 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 23:25:21.847002880+00:00 I0114 23:25:16.889263 129302 clock.cpp:159] Have timeout(s) at 2016-01-
[jira] [Created] (MESOS-7852) Tighten error handling in slaveRunTaskLabelDecorator hook
Zhitao Li created MESOS-7852: Summary: Tighten error handling in slaveRunTaskLabelDecorator hook Key: MESOS-7852 URL: https://issues.apache.org/jira/browse/MESOS-7852 Project: Mesos Issue Type: Bug Components: modules Reporter: Zhitao Li For whatever reason, the {{slaveRunTaskLabelDecorator}} allows the module author to return an error, but the hook manager "silently" suppresses the error in {{HookManager::slaveRunTaskLabelDecorator}} and proceed. This creates some problems: 1) module author could incorrectly assume that an returned error could cause the task run to fail, but it's actually not the case; 2) module author has not way to instruct Mesos agent to stop the task launch if unrecoverable error happens. I suggest we tighten the handling here to fail the task run if module reports an error. A module can still work around soft errors by just returning input labels as-is. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7868) Support virtual filesystem in `Files` interface
Zhitao Li created MESOS-7868: Summary: Support virtual filesystem in `Files` interface Key: MESOS-7868 URL: https://issues.apache.org/jira/browse/MESOS-7868 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Based on conversation with [~bmahler], the {{Files}} interface which is used in [/files/download | http://mesos.apache.org/documentation/latest/endpoints/files/download/] and other files endpoints intended to support virtual path look up, so caller can simply provide something like {{//latest}} to navigate and/or download file in sandbox. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7868) Support virtual filesystem in `Files` interface
[ https://issues.apache.org/jira/browse/MESOS-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7868: - Component/s: agent > Support virtual filesystem in `Files` interface > --- > > Key: MESOS-7868 > URL: https://issues.apache.org/jira/browse/MESOS-7868 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Zhitao Li > > Based on conversation with [~bmahler], the {{Files}} interface which is used > in [/files/download | > http://mesos.apache.org/documentation/latest/endpoints/files/download/] and > other files endpoints intended to support virtual path look up, so caller can > simply provide something like {{//latest}} to > navigate and/or download file in sandbox. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
Zhitao Li created MESOS-7874: Summary: Provide a consistent non-blocking preLaunch hook Key: MESOS-7874 URL: https://issues.apache.org/jira/browse/MESOS-7874 Project: Mesos Issue Type: Improvement Components: modules Reporter: Zhitao Li Assignee: Zhitao Li Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Description: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. was: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own > problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Description: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. was: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking prelaunch hook to integrate with our own > secret management system, and this hook needs to work under both > {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom > executor}} and {{command executor}}, with proper access to {{TaskInfo}} > (actually certain labels on it). > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7878) Add default value for http_framework_authenticators flag
Zhitao Li created MESOS-7878: Summary: Add default value for http_framework_authenticators flag Key: MESOS-7878 URL: https://issues.apache.org/jira/browse/MESOS-7878 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Priority: Minor Based on http://mesos.apache.org/documentation/latest/configuration/, {{http_authenticator}} has a default value {{basic}} but {{http_framework_authenticators}} does not one. Given that people running default Mesos distribution only has {{basic}} available, I feel that we should add a default value to this flag to avoid surprise to operators when they turn on http framework. Proposing Greg to shepherd. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Description: Our use case: we need a non-blocking way to notify our secret management system during task launching sequence on agent. This mechanism needs to work for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both {{custom executor}} and {{command executor}}, with proper access to labels on {{TaskInfo}}. As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. was: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking way to notify our secret management > system during task launching sequence on agent. This mechanism needs to work > for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both > {{custom executor}} and {{command executor}}, with proper access to labels on > {{TaskInfo}}. > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7893) Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in DockerContainerizer
Zhitao Li created MESOS-7893: Summary: Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in DockerContainerizer Key: MESOS-7893 URL: https://issues.apache.org/jira/browse/MESOS-7893 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li When {{DockerContainerizer}} and a non-command executor is used together, the hook in {{slavePreLaunchDockerTaskExecutorDecorator}} is called with {{TaskInfo = None()}}. We should keep task info passed to the hook to provide consistent interface. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127628#comment-16127628 ] Zhitao Li commented on MESOS-7874: -- After some discussion we will do the following: 1. convert {{slaveRunTaskLabelDecorator}} and {{masterRunTaskLabelDecorator}} to be non-blocking; 2. ensure {{slavePreLaunchDockerTaskExecutorDecorator}} is called consistently in {{DockerContainerizer}}. I will repurpose this task for the first action item, and file MESOS-7893 for the second. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking way to notify our secret management > system during task launching sequence on agent. This mechanism needs to work > for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both > {{custom executor}} and {{command executor}}, with proper access to labels on > {{TaskInfo}}. > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Summary: Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API (was: Provide a consistent non-blocking preLaunch hook) > Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to > non-blocking API > -- > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking way to notify our secret management > system during task launching sequence on agent. This mechanism needs to work > for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both > {{custom executor}} and {{command executor}}, with proper access to labels on > {{TaskInfo}}. > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7893) Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in DockerContainerizer
[ https://issues.apache.org/jira/browse/MESOS-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li reassigned MESOS-7893: Shepherd: Till Toenshoff Assignee: Zhitao Li Labels: docker hooks module (was: ) Component/s: modules docker > Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in > DockerContainerizer > - > > Key: MESOS-7893 > URL: https://issues.apache.org/jira/browse/MESOS-7893 > Project: Mesos > Issue Type: Improvement > Components: docker, modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: docker, hooks, module > > When {{DockerContainerizer}} and a non-command executor is used together, the > hook in {{slavePreLaunchDockerTaskExecutorDecorator}} is called with > {{TaskInfo = None()}}. > We should keep task info passed to the hook to provide consistent interface. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7874) Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127771#comment-16127771 ] Zhitao Li commented on MESOS-7874: -- About implementation: The change to hook.hpp, hook/manager.hpp(cpp) should be relative straightforward. For changes to {{Master}} class, I took a quick look and there seemed to be two different paths: * Performing unblocking hook before `Master::_accept` * Pro: * Can be done in parallel with authorization (the other nonblocking thing in operation); * Simpler handling for sending messages to slave: because all things will be ready in {{Master::_accept}}, we can still send corresponding messages to slave (`CheckpointMessage` for RESERVE/UNRESERVE/..., `RunTask` or `RunTaskGroup` for running task/taskgroup, etc). * Con: * Task validation and authorization are not performed yet so hooks could seen tasks which never got launched * technically it's always true if the agent disconnected/goes down, or the `send(slave->pid, message);` goes dropped. Framework are reliably told task status, but hooks are not delivered with it. * More thoughts: * Maybe we should consider creating a private helper struct on Master class to mutate `OfferOperation` (adding task label is only one of that), to facilitate further changes? * Perform the hook inside `Master::_accept` * Pro: * We already know there is a pending task launching, so less code on this part; * Con: * To preserve the ordering for messages, we would need to change `void Master::_apply(...)` to ask it return a `Future` and cache it, and only send out all messages once everything is ready. I'm inclined to go with first path, but some discussion with people more familiar with the large master code base is definitely welcomed. Thanks! > Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to > non-blocking API > -- > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking way to notify our secret management > system during task launching sequence on agent. This mechanism needs to work > for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both > {{custom executor}} and {{command executor}}, with proper access to labels on > {{TaskInfo}}. > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7874) Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127771#comment-16127771 ] Zhitao Li edited comment on MESOS-7874 at 8/15/17 8:27 PM: --- About implementation: The change to hook.hpp, hook/manager.hpp(cpp) should be relative straightforward. For changes to {{Master}} class, I took a quick look and there seemed to be two different paths: * Performing unblocking hook before `Master::_accept` ** Pro: *** Can be done in parallel with authorization (the other nonblocking thing in operation); *** Simpler handling for sending messages to slave: because all things will be ready in {{Master::_accept}}, we can still send corresponding messages to slave (`CheckpointMessage` for RESERVE/UNRESERVE/..., `RunTask` or `RunTaskGroup` for running task/taskgroup, etc). ** Con: *** Task validation and authorization are not performed yet so hooks could seen tasks which never got launched technically it's always true if the agent disconnected/goes down, or the `send(slave->pid, message);` goes dropped. Framework are reliably told task status, but hooks are not delivered with it. ** More thoughts: *** Maybe we should consider creating a private helper struct on Master class to mutate `OfferOperation` (adding task label is only one of that), to facilitate further changes? * Perform the hook inside `Master::_accept` ** Pro: *** We already know there is a pending task launching, so less code on this part; ** Con: *** To preserve the ordering for messages, we would need to change `void Master::_apply(...)` to ask it return a `Future` and cache it, and only send out all messages once everything is ready. I'm inclined to go with first path, but some discussion with people more familiar with the large master code base is definitely welcomed. Thanks! was (Author: zhitao): About implementation: The change to hook.hpp, hook/manager.hpp(cpp) should be relative straightforward. For changes to {{Master}} class, I took a quick look and there seemed to be two different paths: * Performing unblocking hook before `Master::_accept` * Pro: * Can be done in parallel with authorization (the other nonblocking thing in operation); * Simpler handling for sending messages to slave: because all things will be ready in {{Master::_accept}}, we can still send corresponding messages to slave (`CheckpointMessage` for RESERVE/UNRESERVE/..., `RunTask` or `RunTaskGroup` for running task/taskgroup, etc). * Con: * Task validation and authorization are not performed yet so hooks could seen tasks which never got launched * technically it's always true if the agent disconnected/goes down, or the `send(slave->pid, message);` goes dropped. Framework are reliably told task status, but hooks are not delivered with it. * More thoughts: * Maybe we should consider creating a private helper struct on Master class to mutate `OfferOperation` (adding task label is only one of that), to facilitate further changes? * Perform the hook inside `Master::_accept` * Pro: * We already know there is a pending task launching, so less code on this part; * Con: * To preserve the ordering for messages, we would need to change `void Master::_apply(...)` to ask it return a `Future` and cache it, and only send out all messages once everything is ready. I'm inclined to go with first path, but some discussion with people more familiar with the large master code base is definitely welcomed. Thanks! > Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to > non-blocking API > -- > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking way to notify our secret management > system during task launching sequence on agent. This mechanism needs to work > for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both > {{custom executor}} and {{command executor}}, with proper access to labels on > {{TaskInfo}}. > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I gue
[jira] [Created] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory
Zhitao Li created MESOS-7899: Summary: Expose sandboxes using virtual paths and hide the agent work directory Key: MESOS-7899 URL: https://issues.apache.org/jira/browse/MESOS-7899 Project: Mesos Issue Type: Task Reporter: Zhitao Li Assignee: Zhitao Li {{Files}} interface already supports a virtual file system. We should figure out a way to enable this in {{ /files/download}} endpoint to hide agent sandbox. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7868) Support virtual filesystem in `Files` interface
[ https://issues.apache.org/jira/browse/MESOS-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130854#comment-16130854 ] Zhitao Li commented on MESOS-7868: -- I filed MESOS-7899 for the task > Support virtual filesystem in `Files` interface > --- > > Key: MESOS-7868 > URL: https://issues.apache.org/jira/browse/MESOS-7868 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Zhitao Li > > Based on conversation with [~bmahler], the {{Files}} interface which is used > in [/files/download | > http://mesos.apache.org/documentation/latest/endpoints/files/download/] and > other files endpoints intended to support virtual path look up, so caller can > simply provide something like {{//latest}} to > navigate and/or download file in sandbox. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory
[ https://issues.apache.org/jira/browse/MESOS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130905#comment-16130905 ] Zhitao Li commented on MESOS-7899: -- [~bmahler], what's your suggestion on the virtual path should be? One idea I have is to use a relative path {{frameworks//executors//latest}}. By omitting root directory, files API on agent will default to browsing in {{//slaves/}}. The alternative is to provide a fake {{/latest-agent-work-dir}} and mount the executor directory there. However, endpoint/API user still need to know what this fake path is to properly use it. I prefer the relative path idea but want to go over it with you before starting. Thanks. > Expose sandboxes using virtual paths and hide the agent work directory > -- > > Key: MESOS-7899 > URL: https://issues.apache.org/jira/browse/MESOS-7899 > Project: Mesos > Issue Type: Task >Reporter: Zhitao Li >Assignee: Zhitao Li > > {{Files}} interface already supports a virtual file system. We should figure > out a way to enable this in {{ /files/download}} endpoint to hide agent > sandbox. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory
[ https://issues.apache.org/jira/browse/MESOS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7899: - Shepherd: Benjamin Mahler > Expose sandboxes using virtual paths and hide the agent work directory > -- > > Key: MESOS-7899 > URL: https://issues.apache.org/jira/browse/MESOS-7899 > Project: Mesos > Issue Type: Task >Reporter: Zhitao Li >Assignee: Zhitao Li > > {{Files}} interface already supports a virtual file system. We should figure > out a way to enable this in {{ /files/download}} endpoint to hide agent > sandbox. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-5893) mesos-executor should adopt and reap orphan child processes
[ https://issues.apache.org/jira/browse/MESOS-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155588#comment-16155588 ] Zhitao Li commented on MESOS-5893: -- Is this problem still there, [~jieyu]? > mesos-executor should adopt and reap orphan child processes > --- > > Key: MESOS-5893 > URL: https://issues.apache.org/jira/browse/MESOS-5893 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.1.0 > Environment: mesos compiled from git master ( 1.1.0 ) > {{../configure --enable-ssl --enable-libevent --prefix=/usr --enable-optimize > --enable-silent-rules --enable-xfs-disk-isolator}} > isolators : > {{namespaces/pid,cgroups/cpu,cgroups/mem,filesystem/linux,docker/runtime,network/cni,docker/volume}} >Reporter: Stéphane Cottin > Labels: containerizer > > mesos containerizer does not properly handle children death. > discovered using marathon-lb, each topology update fork another haproxy, the > old haproxy process should properly die after its last client connection is > terminated, but turn into a zombie. > {noformat} > 7716 ?Ssl0:00 | \_ mesos-executor > --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox > --user=root --working_directory=/marathon-lb > --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491 > 7813 ?Ss 0:00 | | \_ sh -c /marathon-lb/run sse > --marathon https://marathon:8443 --auth-credentials user:pass --group > 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050 > 7823 ?S 0:00 | | | \_ /bin/bash /marathon-lb/run sse > --marathon https://marathon:8443 --auth-credentials user:pass --group > external --ssl-certs /certs --max-serv-port-ip-per-task 20050 > 7827 ?S 0:00 | | | \_ /usr/bin/runsv > /marathon-lb/service/haproxy > 7829 ?S 0:00 | | | | \_ /bin/bash ./run > 8879 ?S 0:00 | | | | \_ sleep 0.5 > 7828 ?Sl 0:00 | | | \_ python3 > /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config > /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload > /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 > --auth-credentials user:pass --group external --max-serv-port-ip-per-task > 20050 > 7906 ?Zs 0:00 | | \_ [haproxy] > 8628 ?Zs 0:00 | | \_ [haproxy] > 8722 ?Ss 0:00 | | \_ haproxy -p /tmp/haproxy.pid -f > /marathon-lb/haproxy.cfg -D -sf 144 52 > {noformat} > update: mesos-executor should be registered as a subreaper ( > http://man7.org/linux/man-pages/man2/prctl.2.html ) and propagate signals. > code sample: https://github.com/krallin/tini/blob/master/src/tini.c -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-5582) Create a `cgroups/devices` isolator.
[ https://issues.apache.org/jira/browse/MESOS-5582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157331#comment-16157331 ] Zhitao Li commented on MESOS-5582: -- Can this be closed already? > Create a `cgroups/devices` isolator. > > > Key: MESOS-5582 > URL: https://issues.apache.org/jira/browse/MESOS-5582 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: gpu, isolator, mesosphere > > Currently, all the logic for the `cgroups/devices` isolator is bundled into > the Nvidia GPU Isolator. We should abstract it out into it's own component > and remove the redundant logic from the Nvidia GPU Isolator. Assuming the > guaranteed ordering between isolators from MESOS-5581, we can be sure that > the dependency order between the `cgroups/devices` and `gpu/nvidia` isolators > is met. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7960) Deprecate non virtual path browse/read for sandbox
Zhitao Li created MESOS-7960: Summary: Deprecate non virtual path browse/read for sandbox Key: MESOS-7960 URL: https://issues.apache.org/jira/browse/MESOS-7960 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Priority: Minor We added support to browse and read files in executor's latest sandbox run directory in Mesos-7899. We should remove support for physical path after Mesos 2.0 because it requires the {{work_dir}} and {{agent_id}}, which are not necessary to expose to frameworks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content
[ https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009 ] Zhitao Li commented on MESOS-7366: -- [~jieyu], sorry for reviving this task, but we might have missed a case for {{unmount} in linux.cpp. [This unmount call |https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] can still fail if device is busy. > Agent sandbox gc could accidentally delete the entire persistent volume > content > --- > > Key: MESOS-7366 > URL: https://issues.apache.org/jira/browse/MESOS-7366 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: Zhitao Li >Assignee: Jie Yu >Priority: Blocker > Fix For: 1.0.4, 1.1.2, 1.2.1 > > > When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) > executor directory gc being invoked, agent seems to emit a log like: > ``` > Failed to delete directory /runs//volume: Device or > resource busy > ``` > After this, the persistent volume directory is empty. > This could trigger data loss on critical workload so we should fix this ASAP. > The triggering environment is a custom executor w/o rootfs image. > Please let me know if you need more signal. > {noformat} > I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' > to user 'uber' > I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources > cpus(cassandra-cstar-location-store, cassandra, {resource_id: > 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; > mem(cassandra-cstar-location-store, cassandra, {resource_id: > 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; > ports(cassandra-cstar-location-store, cassandra, {resource_id: > fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' > I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task > 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 > I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container > d5a56564-3e24-4c60-9919-746710b78377 for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 > I0407 15:18:22.767514 22766 linux.cpp:730] Mounting > '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff' > to > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' > for persistent volume disk(cassandra-cstar-location-store, cassandra, > {resource_id: > fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 > of container d5a56564-3e24-4c60-9919-746710b78377 > I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's > forked pid 6892 to > '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid' > I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837 > I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task > 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837 > I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra > meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656 > 4-3e24-4c60-9919-746710b78377/volume' for persistent volume > disk(cassandra-cstar-loca
[jira] [Comment Edited] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content
[ https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009 ] Zhitao Li edited comment on MESOS-7366 at 9/20/17 11:45 PM: [~jieyu], sorry for reviving this task, but we might have missed a case for *unmount* in linux.cpp. [This unmount call |https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] can still fail if device is busy. was (Author: zhitao): [~jieyu], sorry for reviving this task, but we might have missed a case for {{unmount} in linux.cpp. [This unmount call |https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] can still fail if device is busy. > Agent sandbox gc could accidentally delete the entire persistent volume > content > --- > > Key: MESOS-7366 > URL: https://issues.apache.org/jira/browse/MESOS-7366 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: Zhitao Li >Assignee: Jie Yu >Priority: Blocker > Fix For: 1.0.4, 1.1.2, 1.2.1 > > > When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) > executor directory gc being invoked, agent seems to emit a log like: > ``` > Failed to delete directory /runs//volume: Device or > resource busy > ``` > After this, the persistent volume directory is empty. > This could trigger data loss on critical workload so we should fix this ASAP. > The triggering environment is a custom executor w/o rootfs image. > Please let me know if you need more signal. > {noformat} > I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' > to user 'uber' > I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources > cpus(cassandra-cstar-location-store, cassandra, {resource_id: > 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; > mem(cassandra-cstar-location-store, cassandra, {resource_id: > 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; > ports(cassandra-cstar-location-store, cassandra, {resource_id: > fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' > I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task > 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 > I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container > d5a56564-3e24-4c60-9919-746710b78377 for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 > I0407 15:18:22.767514 22766 linux.cpp:730] Mounting > '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff' > to > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' > for persistent volume disk(cassandra-cstar-location-store, cassandra, > {resource_id: > fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 > of container d5a56564-3e24-4c60-9919-746710b78377 > I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's > forked pid 6892 to > '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid' > I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837 > I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task > 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at execu
[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart
[ https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177221#comment-16177221 ] Zhitao Li commented on MESOS-1739: -- Ping on this too. I'm willing to work on this in the next couple of months and push this to happen. > Allow slave reconfiguration on restart > -- > > Key: MESOS-1739 > URL: https://issues.apache.org/jira/browse/MESOS-1739 > Project: Mesos > Issue Type: Epic >Reporter: Patrick Reilly > Labels: external-volumes, mesosphere, myriad > > Make it so that either via a slave restart or a out of process "reconfigure" > ping, the attributes and resources of a slave can be updated to be a superset > of what they used to be. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks
Zhitao Li created MESOS-8018: Summary: Allow framework to opt-in to forward executor's JWT token to the tasks Key: MESOS-8018 URL: https://issues.apache.org/jira/browse/MESOS-8018 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Nested container API is an awesome feature and enabled a lot of interesting use cases. A pattern we have seen multiple times is that a task (often the only one) launched by default executor wants to further creates containers nested behind itself (or the executor) to run some different workload. Because the entire request is 1) completely local to the executor container, 2) okay to be bounded within the executor's lifecycle, we'd like to allow the task to use the mesos agent API directly to create these nested containers. However, it creates a problem when we want to enable HTTP executor authentication because the JWT auth tokens are only available to the executor so the task's API request will be rejected. Requiring framework owner to fork or create a custom executor simply for this purpose also seems a bit too heavy. My proposal is to allow framework to opt-in with some field so that the launched task will receive certain environment variables from default executor, so the task can "act upon" the executor. One idea is to add a new field to allow certain environment variables to be forwarded from executor to task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks
[ https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184562#comment-16184562 ] Zhitao Li commented on MESOS-8018: -- [~jamespeach] If the framework *opt-in* to this behavior, then the task will be allowed to whatever the (default) executor can do through agent HTTP API, possibly including launching a privileged task within the executor container tree. > Allow framework to opt-in to forward executor's JWT token to the tasks > -- > > Key: MESOS-8018 > URL: https://issues.apache.org/jira/browse/MESOS-8018 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li > > Nested container API is an awesome feature and enabled a lot of interesting > use cases. A pattern we have seen multiple times is that a task (often the > only one) launched by default executor wants to further creates containers > nested behind itself (or the executor) to run some different workload. > Because the entire request is 1) completely local to the executor container, > 2) okay to be bounded within the executor's lifecycle, we'd like to allow the > task to use the mesos agent API directly to create these nested containers. > However, it creates a problem when we want to enable HTTP executor > authentication because the JWT auth tokens are only available to the executor > so the task's API request will be rejected. > Requiring framework owner to fork or create a custom executor simply for this > purpose also seems a bit too heavy. > My proposal is to allow framework to opt-in with some field so that the > launched task will receive certain environment variables from default > executor, so the task can "act upon" the executor. One idea is to add a new > field to allow certain environment variables to be forwarded from executor to > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks
[ https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184562#comment-16184562 ] Zhitao Li edited comment on MESOS-8018 at 9/28/17 6:00 PM: --- [~jamespeach] If the framework *opt-in* to this behavior, then the task will be allowed to whatever the (default) executor can do through agent HTTP API, possibly including launching a privileged task within the executor container tree, if other part of AuthZ permits that. was (Author: zhitao): [~jamespeach] If the framework *opt-in* to this behavior, then the task will be allowed to whatever the (default) executor can do through agent HTTP API, possibly including launching a privileged task within the executor container tree. > Allow framework to opt-in to forward executor's JWT token to the tasks > -- > > Key: MESOS-8018 > URL: https://issues.apache.org/jira/browse/MESOS-8018 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li > > Nested container API is an awesome feature and enabled a lot of interesting > use cases. A pattern we have seen multiple times is that a task (often the > only one) launched by default executor wants to further creates containers > nested behind itself (or the executor) to run some different workload. > Because the entire request is 1) completely local to the executor container, > 2) okay to be bounded within the executor's lifecycle, we'd like to allow the > task to use the mesos agent API directly to create these nested containers. > However, it creates a problem when we want to enable HTTP executor > authentication because the JWT auth tokens are only available to the executor > so the task's API request will be rejected. > Requiring framework owner to fork or create a custom executor simply for this > purpose also seems a bit too heavy. > My proposal is to allow framework to opt-in with some field so that the > launched task will receive certain environment variables from default > executor, so the task can "act upon" the executor. One idea is to add a new > field to allow certain environment variables to be forwarded from executor to > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks
[ https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184562#comment-16184562 ] Zhitao Li edited comment on MESOS-8018 at 9/28/17 6:01 PM: --- [~jamespeach] If the framework *opt-in* to this behavior, then the task will be allowed to whatever the (default) executor can do through agent HTTP API, possibly including launching a privileged task within the executor container tree, if other part of AuthZ permits that. My rationale here is pretty much intentionally treating this task as an extension part of the executor. I'd argue this is simpler than forcing everyone to write an executor. was (Author: zhitao): [~jamespeach] If the framework *opt-in* to this behavior, then the task will be allowed to whatever the (default) executor can do through agent HTTP API, possibly including launching a privileged task within the executor container tree, if other part of AuthZ permits that. > Allow framework to opt-in to forward executor's JWT token to the tasks > -- > > Key: MESOS-8018 > URL: https://issues.apache.org/jira/browse/MESOS-8018 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li > > Nested container API is an awesome feature and enabled a lot of interesting > use cases. A pattern we have seen multiple times is that a task (often the > only one) launched by default executor wants to further creates containers > nested behind itself (or the executor) to run some different workload. > Because the entire request is 1) completely local to the executor container, > 2) okay to be bounded within the executor's lifecycle, we'd like to allow the > task to use the mesos agent API directly to create these nested containers. > However, it creates a problem when we want to enable HTTP executor > authentication because the JWT auth tokens are only available to the executor > so the task's API request will be rejected. > Requiring framework owner to fork or create a custom executor simply for this > purpose also seems a bit too heavy. > My proposal is to allow framework to opt-in with some field so that the > launched task will receive certain environment variables from default > executor, so the task can "act upon" the executor. One idea is to add a new > field to allow certain environment variables to be forwarded from executor to > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8040) Return nested containers in `GET_CONTAINERS` API call
Zhitao Li created MESOS-8040: Summary: Return nested containers in `GET_CONTAINERS` API call Key: MESOS-8040 URL: https://issues.apache.org/jira/browse/MESOS-8040 Project: Mesos Issue Type: Bug Reporter: Zhitao Li Right now, there is no way to directly query agent and know all nested containers' id, parent id and other information. After talking to [~jieyu], `GET_CONTAINERS` API seems a good fit to return this information. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8040) Return nested containers in `GET_CONTAINERS` API call
[ https://issues.apache.org/jira/browse/MESOS-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8040: - Component/s: containerization Issue Type: Improvement (was: Bug) > Return nested containers in `GET_CONTAINERS` API call > - > > Key: MESOS-8040 > URL: https://issues.apache.org/jira/browse/MESOS-8040 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Zhitao Li > > Right now, there is no way to directly query agent and know all nested > containers' id, parent id and other information. > After talking to [~jieyu], `GET_CONTAINERS` API seems a good fit to return > this information. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.
[ https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191610#comment-16191610 ] Zhitao Li commented on MESOS-6240: -- +1 Taking out executor to agent API from TCP to domain socket will also reduce some potential security exposure of agent. Is there a design doc for this work? > Allow executor/agent communication over non-TCP/IP stream socket. > - > > Key: MESOS-6240 > URL: https://issues.apache.org/jira/browse/MESOS-6240 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: Linux and Windows >Reporter: Avinash Sridharan >Assignee: Benjamin Hindman >Priority: Critical > Labels: mesosphere > > Currently, the executor agent communication happens specifically over TCP > sockets. This works fine in most cases, but specifically for the > `MesosContainerizer` when containers are running on CNI networks, this mode > of communication starts imposing constraints on the CNI network. Since, now > there has to connectivity between the CNI network (on which the executor is > running) and the agent. Introducing paths from a CNI network to the > underlying agent, at best, creates headaches for operators and at worst > introduces serious security holes in the network, since it is breaking the > isolation between the container CNI network and the host network (on which > the agent is running). > In order to simplify/strengthen deployment of Mesos containers on CNI > networks we therefore need to move away from using TCP/IP sockets for > executor/agent communication. Since, executor and agent are guaranteed to run > on the same host, the above problems can be resolved if, for the > `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of > TCP/IP sockets for the executor/agent communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8070) Bundled GRPC build does not build on Debian 8
Zhitao Li created MESOS-8070: Summary: Bundled GRPC build does not build on Debian 8 Key: MESOS-8070 URL: https://issues.apache.org/jira/browse/MESOS-8070 Project: Mesos Issue Type: Bug Reporter: Zhitao Li Assignee: Chun-Hung Hsiao Debian 8 includes an outdated version of libc-ares-dev, which prevents bundled GRPC to build. I believe [~chhsia0] already has a fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8075) Add RWMutex to libprocess
Zhitao Li created MESOS-8075: Summary: Add RWMutex to libprocess Key: MESOS-8075 URL: https://issues.apache.org/jira/browse/MESOS-8075 Project: Mesos Issue Type: Task Components: libprocess Reporter: Zhitao Li We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide better concurrecy protection for mutual exclusive actions, but allow high concurrency for actions which can be performed at the same time. One use case is image garbage collection: the new API {{provisioner::pruneImages}} needs to be mutually exclusive from {{provisioner::provision}}, but multiple {{{provisioner::provision}} can concurrently run safely. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8075) Add RWMutex to libprocess
[ https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li reassigned MESOS-8075: Assignee: Zhitao Li > Add RWMutex to libprocess > - > > Key: MESOS-8075 > URL: https://issues.apache.org/jira/browse/MESOS-8075 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Zhitao Li >Assignee: Zhitao Li > > We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide > better concurrecy protection for mutual exclusive actions, but allow high > concurrency for actions which can be performed at the same time. > One use case is image garbage collection: the new API > {{provisioner::pruneImages}} needs to be mutually exclusive from > {{provisioner::provision}}, but multiple {{{provisioner::provision}} can > concurrently run safely. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8079) Checkpoint and recover layers used to provision rootfs in provisioner
Zhitao Li created MESOS-8079: Summary: Checkpoint and recover layers used to provision rootfs in provisioner Key: MESOS-8079 URL: https://issues.apache.org/jira/browse/MESOS-8079 Project: Mesos Issue Type: Task Components: provisioner Reporter: Zhitao Li This information will be necessary for {{provisioner}} to determine all layers of active containers, which we need to retain when image gc happens. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8075) Add RWMutex to libprocess
[ https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8075: - Shepherd: Benjamin Hindman > Add RWMutex to libprocess > - > > Key: MESOS-8075 > URL: https://issues.apache.org/jira/browse/MESOS-8075 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Zhitao Li >Assignee: Zhitao Li > > We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide > better concurrecy protection for mutual exclusive actions, but allow high > concurrency for actions which can be performed at the same time. > One use case is image garbage collection: the new API > {{provisioner::pruneImages}} needs to be mutually exclusive from > {{provisioner::provision}}, but multiple {{{provisioner::provision}} can > concurrently run safely. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
[ https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8090: - Affects Version/s: 1.4.0 > Mesos 1.4.0 crashes with 1.3.x agent with oversubscription > -- > > Key: MESOS-8090 > URL: https://issues.apache.org/jira/browse/MESOS-8090 > Project: Mesos > Issue Type: Bug > Components: master, oversubscription >Affects Versions: 1.4.0 >Reporter: Zhitao Li >Assignee: Michael Park > > We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a > over-subscription enabled agent running 1.3.1 code. > The crash line is: > resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 > Stack trace in gdb: > {panel:title=My title} > #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x7f22f3554448 in __GI_abort () at abort.c:89 > #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at > src/utilities.cc:147 > #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 > #4 0x7f22f61566cd in google::LogMessage::SendToLog (this= out>) at src/logging.cc:1412 > #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at > src/logging.cc:1281 > #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal > (this=, __in_chrg=) at src/logging.cc:1984 > #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at > /mesos/src/common/resources.cpp:1051 > #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty > (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 > #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, > that=...) at /mesos/src/common/resources.cpp:1993 > #10 0x7f22f527f860 in mesos::Resources::operator+= > (this=this@entry=0x7f22e713d400, that=...) at > /mesos/src/common/resources.cpp:2016 > #11 0x7f22f527f91d in mesos::Resources::operator+= > (this=this@entry=0x7f22e713d400, that=...) at > /mesos/src/common/resources.cpp:2025 > #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, > _resources=...) at /mesos/src/common/resources.cpp:1277 > #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave > (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681 > #14 0x7f22f550adc1 in > ProtobufProcess::_handlerM > (t=0x558137bbae70, method= > (void > (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, > const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 > const&)>, > data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J") > at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799 > #15 0x7f22f54c8791 in > ProtobufProcess::visit (this=0x558137bbae70, > event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104 > #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit > (this=this@entry=0x558137bbae70, event=...) at > /mesos/src/master/master.cpp:1643 > #17 0x7f22f547014d in mesos::internal::master::Master::visit > (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575 > #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at > /mesos/3rdparty/libprocess/include/process/process.hpp:87 > #19 process::ProcessManager::resume (this=, > process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346 > #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at > /mesos/3rdparty/libprocess/src/process.cpp:2881 > #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700 > #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688 > #23 > std::thread::_Impl()> > >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115 > #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at > pthread_create.c:309 > #26 0x7f22f360662d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > {panel} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
Zhitao Li created MESOS-8090: Summary: Mesos 1.4.0 crashes with 1.3.x agent with oversubscription Key: MESOS-8090 URL: https://issues.apache.org/jira/browse/MESOS-8090 Project: Mesos Issue Type: Bug Components: master, oversubscription Reporter: Zhitao Li Assignee: Michael Park We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a over-subscription enabled agent running 1.3.1 code. The crash line is: resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 Stack trace in gdb: {panel:title=My title} #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f22f3554448 in __GI_abort () at abort.c:89 #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at src/utilities.cc:147 #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 #4 0x7f22f61566cd in google::LogMessage::SendToLog (this=) at src/logging.cc:1412 #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at src/logging.cc:1281 #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal (this=, __in_chrg=) at src/logging.cc:1984 #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at /mesos/src/common/resources.cpp:1051 #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:1993 #10 0x7f22f527f860 in mesos::Resources::operator+= (this=this@entry=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:2016 #11 0x7f22f527f91d in mesos::Resources::operator+= (this=this@entry=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:2025 #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, _resources=...) at /mesos/src/common/resources.cpp:1277 #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681 #14 0x7f22f550adc1 in ProtobufProcess::_handlerM (t=0x558137bbae70, method= (void (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 , data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J") at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799 #15 0x7f22f54c8791 in ProtobufProcess::visit (this=0x558137bbae70, event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104 #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit (this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643 #17 0x7f22f547014d in mesos::internal::master::Master::visit (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575 #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at /mesos/3rdparty/libprocess/include/process/process.hpp:87 #19 process::ProcessManager::resume (this=, process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346 #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at /mesos/3rdparty/libprocess/src/process.cpp:2881 #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700 #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688 #23 std::thread::_Impl()> >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115 #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at pthread_create.c:309 #26 0x7f22f360662d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {panel} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
[ https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8090: - Description: We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a over-subscription enabled agent running 1.3.1 code. The crash line is: {panel:title=My title} resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 {panel} Stack trace in gdb: {panel:title=My title} #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f22f3554448 in __GI_abort () at abort.c:89 #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at src/utilities.cc:147 #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 #4 0x7f22f61566cd in google::LogMessage::SendToLog (this=) at src/logging.cc:1412 #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at src/logging.cc:1281 #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal (this=, __in_chrg=) at src/logging.cc:1984 #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at /mesos/src/common/resources.cpp:1051 #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:1993 #10 0x7f22f527f860 in mesos::Resources::operator+= (this=this@entry=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:2016 #11 0x7f22f527f91d in mesos::Resources::operator+= (this=this@entry=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:2025 #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, _resources=...) at /mesos/src/common/resources.cpp:1277 #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681 #14 0x7f22f550adc1 in ProtobufProcess::_handlerM (t=0x558137bbae70, method= (void (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 , data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J") at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799 #15 0x7f22f54c8791 in ProtobufProcess::visit (this=0x558137bbae70, event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104 #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit (this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643 #17 0x7f22f547014d in mesos::internal::master::Master::visit (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575 #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at /mesos/3rdparty/libprocess/include/process/process.hpp:87 #19 process::ProcessManager::resume (this=, process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346 #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at /mesos/3rdparty/libprocess/src/process.cpp:2881 #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700 #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688 #23 std::thread::_Impl()> >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115 #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at pthread_create.c:309 #26 0x7f22f360662d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {panel} was: We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a over-subscription enabled agent running 1.3.1 code. The crash line is: resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 Stack trace in gdb: {panel:title=My title} #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f22f3554448 in __GI_abort () at abort.c:89 #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at src/utilities.cc:147 #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 #4 0x7f22f61566cd in google::LogMessage::SendToLog (this=) at src/logging.cc:1412 #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at src/logging.cc:1281 #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal (this=, __in_chrg=) at src/logging.cc:1984 #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at /mesos/src/common/resources.cpp:1051 #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) at /mesos/src/common/re
[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
[ https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8090: - Description: We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a over-subscription enabled agent running 1.3.1 code. The crash line is: {code:none} resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 {code} Stack trace in gdb: {panel:title=My title} #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f22f3554448 in __GI_abort () at abort.c:89 #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at src/utilities.cc:147 #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 #4 0x7f22f61566cd in google::LogMessage::SendToLog (this=) at src/logging.cc:1412 #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at src/logging.cc:1281 #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal (this=, __in_chrg=) at src/logging.cc:1984 #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at /mesos/src/common/resources.cpp:1051 #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:1993 #10 0x7f22f527f860 in mesos::Resources::operator+= (this=this@entry=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:2016 #11 0x7f22f527f91d in mesos::Resources::operator+= (this=this@entry=0x7f22e713d400, that=...) at /mesos/src/common/resources.cpp:2025 #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, _resources=...) at /mesos/src/common/resources.cpp:1277 #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681 #14 0x7f22f550adc1 in ProtobufProcess::_handlerM (t=0x558137bbae70, method= (void (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 , data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J") at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799 #15 0x7f22f54c8791 in ProtobufProcess::visit (this=0x558137bbae70, event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104 #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit (this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643 #17 0x7f22f547014d in mesos::internal::master::Master::visit (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575 #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at /mesos/3rdparty/libprocess/include/process/process.hpp:87 #19 process::ProcessManager::resume (this=, process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346 #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at /mesos/3rdparty/libprocess/src/process.cpp:2881 #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700 #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688 #23 std::thread::_Impl()> >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115 #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at pthread_create.c:309 #26 0x7f22f360662d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {panel} was: We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a over-subscription enabled agent running 1.3.1 code. The crash line is: {panel:title=My title} resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 {panel} Stack trace in gdb: {panel:title=My title} #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f22f3554448 in __GI_abort () at abort.c:89 #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at src/utilities.cc:147 #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 #4 0x7f22f61566cd in google::LogMessage::SendToLog (this=) at src/logging.cc:1412 #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at src/logging.cc:1281 #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal (this=, __in_chrg=) at src/logging.cc:1984 #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at /mesos/src/common/resources.cpp:1051 #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) at /
[jira] [Commented] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
[ https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208103#comment-16208103 ] Zhitao Li commented on MESOS-8090: -- A quick attempt to fix: https://reviews.apache.org/r/63084/ > Mesos 1.4.0 crashes with 1.3.x agent with oversubscription > -- > > Key: MESOS-8090 > URL: https://issues.apache.org/jira/browse/MESOS-8090 > Project: Mesos > Issue Type: Bug > Components: master, oversubscription >Affects Versions: 1.4.0 >Reporter: Zhitao Li >Assignee: Michael Park > > We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a > over-subscription enabled agent running 1.3.1 code. > The crash line is: > {code:none} > resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19 > {code} > Stack trace in gdb: > {panel:title=My title} > #0 0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x7f22f3554448 in __GI_abort () at abort.c:89 > #2 0x7f22f615cd79 in google::DumpStackTraceAndExit () at > src/utilities.cc:147 > #3 0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458 > #4 0x7f22f61566cd in google::LogMessage::SendToLog (this= out>) at src/logging.cc:1412 > #5 0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at > src/logging.cc:1281 > #6 0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal > (this=, __in_chrg=) at src/logging.cc:1984 > #7 0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at > /mesos/src/common/resources.cpp:1051 > #8 0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty > (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173 > #9 0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, > that=...) at /mesos/src/common/resources.cpp:1993 > #10 0x7f22f527f860 in mesos::Resources::operator+= > (this=this@entry=0x7f22e713d400, that=...) at > /mesos/src/common/resources.cpp:2016 > #11 0x7f22f527f91d in mesos::Resources::operator+= > (this=this@entry=0x7f22e713d400, that=...) at > /mesos/src/common/resources.cpp:2025 > #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, > _resources=...) at /mesos/src/common/resources.cpp:1277 > #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave > (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681 > #14 0x7f22f550adc1 in > ProtobufProcess::_handlerM > (t=0x558137bbae70, method= > (void > (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, > const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 > const&)>, > data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J") > at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799 > #15 0x7f22f54c8791 in > ProtobufProcess::visit (this=0x558137bbae70, > event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104 > #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit > (this=this@entry=0x558137bbae70, event=...) at > /mesos/src/master/master.cpp:1643 > #17 0x7f22f547014d in mesos::internal::master::Master::visit > (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575 > #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at > /mesos/3rdparty/libprocess/include/process/process.hpp:87 > #19 process::ProcessManager::resume (this=, > process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346 > #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at > /mesos/3rdparty/libprocess/src/process.cpp:2881 > #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700 > #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688 > #23 > std::thread::_Impl()> > >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115 > #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at > pthread_create.c:309 > #26 0x7f22f360662d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > {panel} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume
Zhitao Li created MESOS-8161: Summary: Potentially dangerous dangling mount when stopping task with persistent volume Key: MESOS-8161 URL: https://issues.apache.org/jira/browse/MESOS-8161 Project: Mesos Issue Type: Bug Reporter: Zhitao Li Priority: Critical While we fixed a case in MESOS-7366 when an executor terminates, it seems like a very similar case can still happen if a task with a persistent volume terminates, executor still active, and [this unmount call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] fails due to "device busy". I believe if agent gc or something other things run on the host mount namespace, it is possible to lose persistent volume data because of this. Agent log: {code:none} I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0235559588 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929 I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-02355595888 f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929 I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f o f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6 f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61 f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050 I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0235559588 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929 I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43046 with User-Agent='Python-urllib/2.7' I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43144 with User-Agent='Python-urllib/2.7' I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283560425078715days I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7' I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:44732 with User-Agent='Python-urllib/2.7' I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:56414 with User-Agent='filebundle-agent' I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888 f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f 6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929 E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state TASK_ERROR I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex ecutor__cbf9
[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content
[ https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235004#comment-16235004 ] Zhitao Li commented on MESOS-7366: -- I filed MESOS-8161 for the other case. > Agent sandbox gc could accidentally delete the entire persistent volume > content > --- > > Key: MESOS-7366 > URL: https://issues.apache.org/jira/browse/MESOS-7366 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: Zhitao Li >Assignee: Jie Yu >Priority: Blocker > Fix For: 1.0.4, 1.1.2, 1.2.1 > > > When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) > executor directory gc being invoked, agent seems to emit a log like: > ``` > Failed to delete directory /runs//volume: Device or > resource busy > ``` > After this, the persistent volume directory is empty. > This could trigger data loss on critical workload so we should fix this ASAP. > The triggering environment is a custom executor w/o rootfs image. > Please let me know if you need more signal. > {noformat} > I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' > to user 'uber' > I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources > cpus(cassandra-cstar-location-store, cassandra, {resource_id: > 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; > mem(cassandra-cstar-location-store, cassandra, {resource_id: > 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; > ports(cassandra-cstar-location-store, cassandra, {resource_id: > fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' > I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task > 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 > I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container > d5a56564-3e24-4c60-9919-746710b78377 for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 > I0407 15:18:22.767514 22766 linux.cpp:730] Mounting > '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff' > to > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' > for persistent volume disk(cassandra-cstar-location-store, cassandra, > {resource_id: > fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 > of container d5a56564-3e24-4c60-9919-746710b78377 > I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's > forked pid 6892 to > '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid' > I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837 > I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task > 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor > 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework > 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837 > I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount > '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra > meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656 > 4-3e24-4c60-9919-746710b78377/volume' for persistent volume > disk(cassandra-cstar-location-store, cassandra, {resource_id: > fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 > of container d5a56564-3e24-4c60-9919-746710b78377 > E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for >
[jira] [Assigned] (MESOS-8280) Mesos Containerizer GC should set 'layers' after checkpointing layer ids in provisioner.
[ https://issues.apache.org/jira/browse/MESOS-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li reassigned MESOS-8280: Assignee: Zhitao Li > Mesos Containerizer GC should set 'layers' after checkpointing layer ids in > provisioner. > > > Key: MESOS-8280 > URL: https://issues.apache.org/jira/browse/MESOS-8280 > Project: Mesos > Issue Type: Bug > Components: image-gc, provisioner >Reporter: Gilbert Song >Assignee: Zhitao Li >Priority: Critical > Labels: containerizer, image-gc, provisioner > > {noformat} > 1 > 22 > 33 > 44 > 1 > 22 > 33 > 44 > I1129 23:24:45.469543 6592 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/MVgVC7/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 > to rootfs > '/tmp/mesos/store/docker/staging/MVgVC7/38135e3743e6dcb66bd1394b633053714333c7b7cf930bfeebfda660c06e/rootfs.overlay' > I1129 23:24:45.473287 6592 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/MVgVC7/sha256:b56ae66c29370df48e7377c8f9baa744a3958058a766793f821dadcb144a4647 > to rootfs > '/tmp/mesos/store/docker/staging/MVgVC7/b5815a31a59b66c909dbf6c670de78690d4b52649b8e283fc2bfd2594f61cca3/rootfs.overlay' > I1129 23:24:45.582002 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 > to rootfs > '/tmp/mesos/store/docker/staging/6Zbc17/e28617c6dd2169bfe2b10017dfaa04bd7183ff840c4f78ebe73fca2a89effeb6/rootfs.overlay' > I1129 23:24:45.589404 6595 metadata_manager.cpp:167] Successfully cached > image 'alpine' > I1129 23:24:45.590204 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 > to rootfs > '/tmp/mesos/store/docker/staging/6Zbc17/be4ce2753831b8952a5b797cf45b2230e1befead6f5db0630bcb24a5f554255e/rootfs.overlay' > I1129 23:24:45.595190 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 > to rootfs > '/tmp/mesos/store/docker/staging/6Zbc17/53b5066c5a7dff5d6f6ef0c1945572d6578c083d550d2a3d575b4cdf7460306f/rootfs.overlay' > I1129 23:24:45.599500 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 > to rootfs > '/tmp/mesos/store/docker/staging/6Zbc17/a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/rootfs.overlay' > I1129 23:24:45.602047 6597 provisioner.cpp:506] Provisioning image rootfs > '/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/rootfses/b5d48445-848d-4274-a4f8-e909351ebc35' > for container > 3bbc3fd1-0138-43a9-94ba-d017d813daac.01de09c5-d8e9-412e-8825-a592d2c875e5 > using overlay backend > I1129 23:24:45.602751 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:1db09adb5ddd7f1a07b6d585a7db747a51c7bd17418d47e91f901bdf420abd66 > to rootfs > '/tmp/mesos/store/docker/staging/6Zbc17/120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/rootfs.overlay' > I1129 23:24:45.603054 6596 overlay.cpp:168] Created symlink > '/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/links' > -> '/tmp/xAWQ8y' > I1129 23:24:45.604398 6596 overlay.cpp:196] Provisioning image rootfs with > overlayfs: > 'lowerdir=/tmp/xAWQ8y/1:/tmp/xAWQ8y/0,upperdir=/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/upperdir,workdir=/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/workdir' > I1129 23:24:45.607802 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 > to rootfs > '/tmp/mesos/store/docker/staging/6Zbc17/42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/rootfs.overlay' > I1129 23:24:45.612139 6594 registry_puller.cpp:395] Extracting layer tar > ball > '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955
[jira] [Commented] (MESOS-8070) Bundled GRPC build does not build on Debian 8
[ https://issues.apache.org/jira/browse/MESOS-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285336#comment-16285336 ] Zhitao Li commented on MESOS-8070: -- [~gilbert], can we make sure this catches 1.5 release? Thanks! > Bundled GRPC build does not build on Debian 8 > - > > Key: MESOS-8070 > URL: https://issues.apache.org/jira/browse/MESOS-8070 > Project: Mesos > Issue Type: Bug >Reporter: Zhitao Li >Assignee: Chun-Hung Hsiao > Fix For: 1.5.0 > > > Debian 8 includes an outdated version of libc-ares-dev, which prevents > bundled GRPC to build. > I believe [~chhsia0] already has a fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8323) Separate resource fetching timeout from executor_registere_timeout
Zhitao Li created MESOS-8323: Summary: Separate resource fetching timeout from executor_registere_timeout Key: MESOS-8323 URL: https://issues.apache.org/jira/browse/MESOS-8323 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Containers could have varying size on images/resources, so it's more desirable to have a separate timeout (in duration) which is separate from executor register timeout. [~bmahler], can we also agree this should be customizable to each task launch request (which hopefully can provide a better value based on its knowledge of artifact size)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8324) Add succeeded metric to container launch in Mesos agent
Zhitao Li created MESOS-8324: Summary: Add succeeded metric to container launch in Mesos agent Key: MESOS-8324 URL: https://issues.apache.org/jira/browse/MESOS-8324 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Only metric on agent related to stability of containerizer is "slave/container_launch_errors" and it does not track standalone/nested containers. I propose we add a container_launch_succeeded counter to track all container launches in containerizer, and also add make sure `error` counter tracks standalone and nested containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8353) Duplicate task for same framework on multiple agents crashes out master after failover
Zhitao Li created MESOS-8353: Summary: Duplicate task for same framework on multiple agents crashes out master after failover Key: MESOS-8353 URL: https://issues.apache.org/jira/browse/MESOS-8353 Project: Mesos Issue Type: Bug Reporter: Zhitao Li We have seen a mesos master crash loop after a leader failover. After more investigation, it seems that a same task ID was managed to be created onto multiple Mesos agents in the cluster. One possible logical sequence which can lead to such problem: 1. Task T1 was launched to master M1 on agent A1 for framework F; 2. Master M1 failed over to M2; 3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 does not know previous T1 yet so it accepted it and sent to A2; 4. A1 reregistered: this probably crashed M2 (because same task cannot be added twice); 5. When M3 tries to come up after M2, it further crashes because both A1 and A2 tried to add a T1 to the framework. (I only have logs to prove the last step right now) This happened on 1.4.0 masters. Although this is probably triggered by incorrect retry logic on framework side, I wonder whether Mesos master should do extra protection to prevent such issue to happen. One possible idea to instruct one of the agents carrying tasks w/ duplicate ID to terminate corresponding tasks, or just refuse to reregister such agents and instruct them to shutdown. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8358) Create agent endpoints for pruning images
Zhitao Li created MESOS-8358: Summary: Create agent endpoints for pruning images Key: MESOS-8358 URL: https://issues.apache.org/jira/browse/MESOS-8358 Project: Mesos Issue Type: Bug Reporter: Zhitao Li Assignee: Zhitao Li This is a follow up on MESOS-4945, but we agreed that we should create a HTTP endpoint on agent to manually trigger image gc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8358) Create agent endpoints for pruning images
[ https://issues.apache.org/jira/browse/MESOS-8358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8358: - Issue Type: Improvement (was: Bug) > Create agent endpoints for pruning images > - > > Key: MESOS-8358 > URL: https://issues.apache.org/jira/browse/MESOS-8358 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li > > This is a follow up on MESOS-4945, but we agreed that we should create a HTTP > endpoint on agent to manually trigger image gc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8365) Create AuthN support for prune images API
[ https://issues.apache.org/jira/browse/MESOS-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-8365: - Target Version/s: 1.5.0 > Create AuthN support for prune images API > - > > Key: MESOS-8365 > URL: https://issues.apache.org/jira/browse/MESOS-8365 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li > > We want to make sure there is a way to configure AuthZ for new API added in > MESOS-8360. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8365) Create AuthN support for prune images API
Zhitao Li created MESOS-8365: Summary: Create AuthN support for prune images API Key: MESOS-8365 URL: https://issues.apache.org/jira/browse/MESOS-8365 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Assignee: Zhitao Li We want to make sure there is a way to configure AuthZ for new API added in MESOS-8360. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316816#comment-16316816 ] Zhitao Li commented on MESOS-4945: -- That one is not necessarily part of this epic. I'll move it out. > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Epic >Reporter: Jie Yu >Assignee: Zhitao Li > Labels: Mesosphere > Fix For: 1.5.0 > > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6893) Track total docker image layer size in store
[ https://issues.apache.org/jira/browse/MESOS-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-6893: - Priority: Minor (was: Major) Description: We want to give cluster operator some insights on total size of docker image layers in store so we can use it for monitoring purpose. Component/s: containerization Issue Type: Improvement (was: Task) Summary: Track total docker image layer size in store (was: Track docker layer size and access time) > Track total docker image layer size in store > > > Key: MESOS-6893 > URL: https://issues.apache.org/jira/browse/MESOS-6893 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Minor > > We want to give cluster operator some insights on total size of docker image > layers in store so we can use it for monitoring purpose. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor.
[ https://issues.apache.org/jira/browse/MESOS-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562761#comment-15562761 ] Zhitao Li commented on MESOS-6134: -- [~alexr], I'm fine with releasing this for 1.2, but I think the patch is already ready for a while and pretty uncontroversial. We have been running the cherry-pick for a while w/o any issue. [~jieyu], do you think we can commit this in 1.1 to close it out. > Port CFS quota support to Docker Containerizer using command executor. > -- > > Key: MESOS-6134 > URL: https://issues.apache.org/jira/browse/MESOS-6134 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Reporter: Zhitao Li >Assignee: Zhitao Li > > MESOS-2154 only partially fixed the CFS quota support in Docker > Containerizer: that fix only works for custom executor. > This tracks the fix for command executor so we can declare this is complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`
[ https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997 ] Zhitao Li commented on MESOS-6177: -- [~anandmazumdar], I'm strongly inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass. This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it. > Return unregistered agents recovered from registrar in `GetAgents` and/or > `/state.json` > --- > > Key: MESOS-6177 > URL: https://issues.apache.org/jira/browse/MESOS-6177 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Zhitao Li >Assignee: Zhitao Li > > Use case: > This can be used for any software which talks to Mesos master to better > understand state of an unregistered agent after a master failover. > If this information is available, the use case in MESOS-6174 can be handled > with a simpler decision of whether the corresponding agent is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`
[ https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997 ] Zhitao Li edited comment on MESOS-6177 at 10/13/16 11:07 PM: - [~anandmazumdar], after some more thoughts, I'm inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass. This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it. was (Author: zhitao): [~anandmazumdar], I'm strongly inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass. This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it. > Return unregistered agents recovered from registrar in `GetAgents` and/or > `/state.json` > --- > > Key: MESOS-6177 > URL: https://issues.apache.org/jira/browse/MESOS-6177 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Zhitao Li >Assignee: Zhitao Li > > Use case: > This can be used for any software which talks to Mesos master to better > understand state of an unregistered agent after a master failover. > If this information is available, the use case in MESOS-6174 can be handled > with a simpler decision of whether the corresponding agent is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`
[ https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997 ] Zhitao Li edited comment on MESOS-6177 at 10/14/16 1:24 AM: (edited) [~anandmazumdar], after some more thoughts, I'm inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. This has the benefit to help operators to know the hostname of the agent id which is not recovered yet without calling registry again. -My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.- -This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it.- was (Author: zhitao): [~anandmazumdar], after some more thoughts, I'm inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass. This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it. > Return unregistered agents recovered from registrar in `GetAgents` and/or > `/state.json` > --- > > Key: MESOS-6177 > URL: https://issues.apache.org/jira/browse/MESOS-6177 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Zhitao Li >Assignee: Zhitao Li > > Use case: > This can be used for any software which talks to Mesos master to better > understand state of an unregistered agent after a master failover. > If this information is available, the use case in MESOS-6174 can be handled > with a simpler decision of whether the corresponding agent is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`
[ https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997 ] Zhitao Li edited comment on MESOS-6177 at 10/14/16 1:25 AM: (edited after recalling that pid is not in SlaveInfo. We should think about adding {{Address}} to {{SlaveInfo}} if possible but that has to be a different ticket) [~anandmazumdar], after some more thoughts, I'm inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. This has the benefit to help operators to know the hostname of the agent id which is not recovered yet without calling registry again. -My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.- -This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it.- was (Author: zhitao): (edited) [~anandmazumdar], after some more thoughts, I'm inclined to return the full {{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state. This has the benefit to help operators to know the hostname of the agent id which is not recovered yet without calling registry again. -My primary intention is to have a hold of {{pid}}, so the operator/subscriber can know the ip:port the agent is listening at. If we only return {{AgentID}}, the operator can do little additional babysitting steps to validate the state of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.- -This is also pretty easy to implement IIUIC: we can simply change the {{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers it.- > Return unregistered agents recovered from registrar in `GetAgents` and/or > `/state.json` > --- > > Key: MESOS-6177 > URL: https://issues.apache.org/jira/browse/MESOS-6177 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Zhitao Li >Assignee: Zhitao Li > > Use case: > This can be used for any software which talks to Mesos master to better > understand state of an unregistered agent after a master failover. > If this information is available, the use case in MESOS-6174 can be handled > with a simpler decision of whether the corresponding agent is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li reassigned MESOS-4945: Assignee: Zhitao Li > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813 ] Zhitao Li commented on MESOS-4945: -- Current plan: - Add a "cleanup" method to store interface, which takes a {{vector}} for "images in use"; - store can choose its own implementation of what it wants to cleanup. Deleted images will be returned in a {{Future>}}; - it's the job of Containerizer/Provisioner to actively prepare the list of "images in use" - initially this can simply be done by traversing all active containers, if provisioner already has all information in its memory; - Initial implementation will add a new flag indicating upper limit of size for docker store directory, and docker::store will delete images until it drops below there; - The invocation to store::cleanup can happen either in a background timer, upon provisioner::destroy, or before the pull? (I have no real preference, but calling it before pull seems safest if we use space based policy?); - Initial implementation on store will traverse all images in the store; - Further optimization including implementing a reference counting and size counting of all images in store, and checkpointing them. We might also need some kind of LRU implementation here. > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6415) Create an unit test for OOM in Mesos containerizer's mem isolator
Zhitao Li created MESOS-6415: Summary: Create an unit test for OOM in Mesos containerizer's mem isolator Key: MESOS-6415 URL: https://issues.apache.org/jira/browse/MESOS-6415 Project: Mesos Issue Type: Improvement Components: testing Reporter: Zhitao Li Assignee: Zhitao Li Priority: Minor It seems like we don't have any integration test practicing the case of exceeding container memory limit. We could add one to cgroups_isolator_tests.cpp. Good starting task for anyone interested in this area, including myself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6415) Create an unit test for OOM in Mesos containerizer's mem isolator
[ https://issues.apache.org/jira/browse/MESOS-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587117#comment-15587117 ] Zhitao Li commented on MESOS-6415: -- [~jieyu] and I chatted on the containerizer slack, and I exactly want to pursue this as a separate test. Slack history: ``` It seems like we don't have any integration test practicing the case of exceeding container memory limit. @jieyu @gilbert ? Jie Yu [4:43 PM] we do have a balloon framework Zhitao Li [4:44 PM] Is it exercised through a test? Jie Yu [4:44 PM] yeah Gilbert Song [4:44 PM] yes, [4:44] through a script in a unit test Jie Yu [4:44 PM] in retrospect, we can simply use a command task [4:45] at the time balloon framework was written, command task does not exist yet Zhitao Li [4:45 PM] I'd volunteer me or someone from our team to write a smaller test, if you want to shepherd (edited) Jie Yu [4:45 PM] yup, i’d be happy to shepherd [4:45] you should add one to cgroups_isolator_tests.cpp Zhitao Li [4:46 PM] Will file an issue and claim it under my umbrella for now. Thanks Jie Yu [4:46 PM] oh [4:46] hold on [4:46] we do have MemoryPressureMesosTest [4:48] but I guess we don’t have a oom test [4:48] memory pressure is mainly for the stats [4:49] yeah, @zhitao, we should add a OOM test ``` > Create an unit test for OOM in Mesos containerizer's mem isolator > - > > Key: MESOS-6415 > URL: https://issues.apache.org/jira/browse/MESOS-6415 > Project: Mesos > Issue Type: Improvement > Components: testing >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Minor > > It seems like we don't have any integration test practicing the case of > exceeding container memory limit. > We could add one to cgroups_isolator_tests.cpp. > Good starting task for anyone interested in this area, including myself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6415) Create an unit test for OOM in Mesos containerizer's mem isolator
[ https://issues.apache.org/jira/browse/MESOS-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587117#comment-15587117 ] Zhitao Li edited comment on MESOS-6415 at 10/19/16 12:10 AM: - [~jieyu] and I chatted on the containerizer slack, and I exactly want to pursue this as a separate test. Slack history: {quote} It seems like we don't have any integration test practicing the case of exceeding container memory limit. @jieyu @gilbert ? Jie Yu [4:43 PM] we do have a balloon framework Zhitao Li [4:44 PM] Is it exercised through a test? Jie Yu [4:44 PM] yeah Gilbert Song [4:44 PM] yes, [4:44] through a script in a unit test Jie Yu [4:44 PM] in retrospect, we can simply use a command task [4:45] at the time balloon framework was written, command task does not exist yet Zhitao Li [4:45 PM] I'd volunteer me or someone from our team to write a smaller test, if you want to shepherd (edited) Jie Yu [4:45 PM] yup, i’d be happy to shepherd [4:45] you should add one to cgroups_isolator_tests.cpp Zhitao Li [4:46 PM] Will file an issue and claim it under my umbrella for now. Thanks Jie Yu [4:46 PM] oh [4:46] hold on [4:46] we do have MemoryPressureMesosTest [4:48] but I guess we don’t have a oom test [4:48] memory pressure is mainly for the stats [4:49] yeah, @zhitao, we should add a OOM test {quote} was (Author: zhitao): [~jieyu] and I chatted on the containerizer slack, and I exactly want to pursue this as a separate test. Slack history: ``` It seems like we don't have any integration test practicing the case of exceeding container memory limit. @jieyu @gilbert ? Jie Yu [4:43 PM] we do have a balloon framework Zhitao Li [4:44 PM] Is it exercised through a test? Jie Yu [4:44 PM] yeah Gilbert Song [4:44 PM] yes, [4:44] through a script in a unit test Jie Yu [4:44 PM] in retrospect, we can simply use a command task [4:45] at the time balloon framework was written, command task does not exist yet Zhitao Li [4:45 PM] I'd volunteer me or someone from our team to write a smaller test, if you want to shepherd (edited) Jie Yu [4:45 PM] yup, i’d be happy to shepherd [4:45] you should add one to cgroups_isolator_tests.cpp Zhitao Li [4:46 PM] Will file an issue and claim it under my umbrella for now. Thanks Jie Yu [4:46 PM] oh [4:46] hold on [4:46] we do have MemoryPressureMesosTest [4:48] but I guess we don’t have a oom test [4:48] memory pressure is mainly for the stats [4:49] yeah, @zhitao, we should add a OOM test ``` > Create an unit test for OOM in Mesos containerizer's mem isolator > - > > Key: MESOS-6415 > URL: https://issues.apache.org/jira/browse/MESOS-6415 > Project: Mesos > Issue Type: Improvement > Components: testing >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Minor > > It seems like we don't have any integration test practicing the case of > exceeding container memory limit. > We could add one to cgroups_isolator_tests.cpp. > Good starting task for anyone interested in this area, including myself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813 ] Zhitao Li edited comment on MESOS-4945 at 10/19/16 8:58 AM: Revised plan in rough steps: * For each image, checkpoint a) container ids, b) time of last container using it being destroyed, and c) size of each layer; ** TODO: how do deal with migration? idea is passing in more info in recover() chain of containerizer -> provisioner -> store; * Change store interface: ** "get(Image)" to "get(Image, ContainerID)", ***The containerID field added can be used to implement ref counting and further book keeping (i.e. get local images information); **add "remove(Image, ContainerID)" virtual function; *** this is optional: store which does not do ref counting can skip implementing. * Make sure provisioner::destroy() call store::remove(Image, ContainerID); * Add command line flag for docker store capacity limit; * In (docker) store::get(Image, ContainerID), after a pull is done, calculate total layer sizes, if above store capacity, remove images with empty container ids (aka not used), sorted by last time not used. Any layer not used is also removed, until total size is dropped below capacity. Open question: 1) In this design, we have one explicit reference counting between {{Container}} and {{Image}} in store. However, this information could be constructed on-the-fly with all containers in {{Containerizer}} class. Do we consider this "double accounting" problematic, or error-prone? 2) Is calling new {{remove(Image, ContainerID)}} from {{Provisioner::destroy()}} sufficient to make sure all book keepings are properly done? was (Author: zhitao): Current plan: - Add a "cleanup" method to store interface, which takes a {{vector}} for "images in use"; - store can choose its own implementation of what it wants to cleanup. Deleted images will be returned in a {{Future>}}; - it's the job of Containerizer/Provisioner to actively prepare the list of "images in use" - initially this can simply be done by traversing all active containers, if provisioner already has all information in its memory; - Initial implementation will add a new flag indicating upper limit of size for docker store directory, and docker::store will delete images until it drops below there; - The invocation to store::cleanup can happen either in a background timer, upon provisioner::destroy, or before the pull? (I have no real preference, but calling it before pull seems safest if we use space based policy?); - Initial implementation on store will traverse all images in the store; - Further optimization including implementing a reference counting and size counting of all images in store, and checkpointing them. We might also need some kind of LRU implementation here. > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813 ] Zhitao Li edited comment on MESOS-4945 at 10/19/16 9:03 AM: Revised plan in rough steps: * For each image, checkpoint a) container ids, b) time of last container using it being destroyed, and c) size of each layer; ** TODO: how do deal with migration? idea is passing in more info in recover() chain of containerizer -> provisioner -> store; * Change store interface: ** "get(Image)" to "get(Image, ContainerID)": The containerID field added can be used to implement ref counting and further book keeping (i.e. get local images information); ** add "remove(Image, ContainerID)" virtual function: this is optional in that store which does not do ref counting can skip implementing. * Make sure provisioner::destroy() call store::remove(Image, ContainerID); * Add command line flag for docker store capacity limit (in bytes); * In (docker) store::get(Image, ContainerID), after a pull is done, calculate total layer sizes, if above store capacity, remove unused images (determined by empty container ids), sorted by last time not used. Any layer not shared by leftover images is also removed, until total size is dropped below capacity. Open question: 1) In this design, we have one explicit reference counting between {{Container}} and {{Image}} in store. However, this information could be constructed on-the-fly with all containers in {{Containerizer}} class. Do we consider this "double accounting" problematic, or error-prone? 2) Is calling new {{remove(Image, ContainerID)}} from {{Provisioner::destroy()}} sufficient to make sure all book keepings are properly done? was (Author: zhitao): Revised plan in rough steps: * For each image, checkpoint a) container ids, b) time of last container using it being destroyed, and c) size of each layer; ** TODO: how do deal with migration? idea is passing in more info in recover() chain of containerizer -> provisioner -> store; * Change store interface: ** "get(Image)" to "get(Image, ContainerID)", ***The containerID field added can be used to implement ref counting and further book keeping (i.e. get local images information); **add "remove(Image, ContainerID)" virtual function; *** this is optional: store which does not do ref counting can skip implementing. * Make sure provisioner::destroy() call store::remove(Image, ContainerID); * Add command line flag for docker store capacity limit; * In (docker) store::get(Image, ContainerID), after a pull is done, calculate total layer sizes, if above store capacity, remove images with empty container ids (aka not used), sorted by last time not used. Any layer not used is also removed, until total size is dropped below capacity. Open question: 1) In this design, we have one explicit reference counting between {{Container}} and {{Image}} in store. However, this information could be constructed on-the-fly with all containers in {{Containerizer}} class. Do we consider this "double accounting" problematic, or error-prone? 2) Is calling new {{remove(Image, ContainerID)}} from {{Provisioner::destroy()}} sufficient to make sure all book keepings are properly done? > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813 ] Zhitao Li edited comment on MESOS-4945 at 10/19/16 4:39 PM: Revised plan in rough steps: * For each image, checkpoint a) container ids, b) time of last container using it being destroyed, and c) size of each layer; ** TODO: how do deal with migration? idea is passing in more info in recover() chain of containerizer -> provisioner -> store; * Change store interface: ** "get(Image)" to "get(Image, ContainerID)": The containerID field added can be used to implement ref counting and further book keeping (i.e. get local images information); ** add "remove(Image, ContainerID)" virtual function: this is optional in that store which does not do ref counting can have an empty implementation. * Make sure provisioner::destroy() call store::remove(Image, ContainerID); * Add command line flag for docker store capacity limit (in bytes); * In (docker) store::get(Image, ContainerID), after a pull is done, calculate total layer sizes, if above store capacity, remove unused images (determined by empty container ids), sorted by last time not used. Any layer not shared by leftover images is also removed, until total size is dropped below capacity. Open question: 1) In this design, we have one explicit reference counting between {{Container}} and {{Image}} in store. However, this information could be constructed on-the-fly with all containers in {{Containerizer}} class. Do we consider this "double accounting" problematic, or error-prone? 2) Is calling new {{remove(Image, ContainerID)}} from {{Provisioner::destroy()}} sufficient to make sure all book keepings are properly done? was (Author: zhitao): Revised plan in rough steps: * For each image, checkpoint a) container ids, b) time of last container using it being destroyed, and c) size of each layer; ** TODO: how do deal with migration? idea is passing in more info in recover() chain of containerizer -> provisioner -> store; * Change store interface: ** "get(Image)" to "get(Image, ContainerID)": The containerID field added can be used to implement ref counting and further book keeping (i.e. get local images information); ** add "remove(Image, ContainerID)" virtual function: this is optional in that store which does not do ref counting can skip implementing. * Make sure provisioner::destroy() call store::remove(Image, ContainerID); * Add command line flag for docker store capacity limit (in bytes); * In (docker) store::get(Image, ContainerID), after a pull is done, calculate total layer sizes, if above store capacity, remove unused images (determined by empty container ids), sorted by last time not used. Any layer not shared by leftover images is also removed, until total size is dropped below capacity. Open question: 1) In this design, we have one explicit reference counting between {{Container}} and {{Image}} in store. However, this information could be constructed on-the-fly with all containers in {{Containerizer}} class. Do we consider this "double accounting" problematic, or error-prone? 2) Is calling new {{remove(Image, ContainerID)}} from {{Provisioner::destroy()}} sufficient to make sure all book keepings are properly done? > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6429) Create metrics for docker store
Zhitao Li created MESOS-6429: Summary: Create metrics for docker store Key: MESOS-6429 URL: https://issues.apache.org/jira/browse/MESOS-6429 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Ideas of metrics we have right now (in order of importance) - size of store (gauge) - amount of data pulled (counter) - number of layers cached (gauge) - number of pulls (counter: rate can be derived externally); Suggestion on what metrics we should still add is welcomed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-4945: - Shepherd: Jie Yu > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6429) Create metrics for docker store
[ https://issues.apache.org/jira/browse/MESOS-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-6429: - Shepherd: Jie Yu Assignee: Zhitao Li > Create metrics for docker store > --- > > Key: MESOS-6429 > URL: https://issues.apache.org/jira/browse/MESOS-6429 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li > > Ideas of metrics we have right now (in order of importance) > - size of store (gauge) > - amount of data pulled (counter) > - number of layers cached (gauge) > - number of pulls (counter: rate can be derived externally); > Suggestion on what metrics we should still add is welcomed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6434) Publicize the test infrastructure for modules
Zhitao Li created MESOS-6434: Summary: Publicize the test infrastructure for modules Key: MESOS-6434 URL: https://issues.apache.org/jira/browse/MESOS-6434 Project: Mesos Issue Type: Task Components: testing Reporter: Zhitao Li As we discussed in today's meeting, the goal is to allow module authors to use Mesos's internal testing infrastructure to test their own modules and replace hacks like https://github.com/dcos/dcos-mesos-modules/blob/bb6f6b22138ae38c9c8305e571deca2e4df7f3b3/configure.ac#L342-L359 Some action items I recall: - clean up existing headers and make them nice to be installed; - determine whether we will allow unversioned protobuf or only v1 protobuf; - create library like libmesos_tests which module authors can link against. [~kaysoky] [~jvanremoortere], please help to fill up more details and triage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6451) Add timer and percentile for docker pull latency distribution.
Zhitao Li created MESOS-6451: Summary: Add timer and percentile for docker pull latency distribution. Key: MESOS-6451 URL: https://issues.apache.org/jira/browse/MESOS-6451 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Zhitao Li Assignee: Zhitao Li The proposal here is to add a timer for both Mesos Containerizer and Docker containerizer to monitor latency distribution of pulling images. This can be used for operators who operates either containerizer in production, and used for migration phase to understand performance variation if any. I plan to use one hour look back window for this timer, unless there is other concern. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6451) Add timer and percentile for docker pull latency distribution.
[ https://issues.apache.org/jira/browse/MESOS-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596439#comment-15596439 ] Zhitao Li commented on MESOS-6451: -- https://reviews.apache.org/r/53105/ for DockerContainerizer. > Add timer and percentile for docker pull latency distribution. > -- > > Key: MESOS-6451 > URL: https://issues.apache.org/jira/browse/MESOS-6451 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Zhitao Li >Assignee: Zhitao Li > > The proposal here is to add a timer for both Mesos Containerizer and Docker > containerizer to monitor latency distribution of pulling images. > This can be used for operators who operates either containerizer in > production, and used for migration phase to understand performance variation > if any. > I plan to use one hour look back window for this timer, unless there is other > concern. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6495) Create metrics for HTTP API endpoint
Zhitao Li created MESOS-6495: Summary: Create metrics for HTTP API endpoint Key: MESOS-6495 URL: https://issues.apache.org/jira/browse/MESOS-6495 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li We should have some metrics about various response code for (scheduler) HTTP API (2xx, 4xx, etc) [~anandmazumdar] suggested that ideally the solution could be easily extended to cover other endpoints if we can directly enhance libprocess, so we can cover other API (Master/Agent). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6495) Create metrics for HTTP API endpoint response codes.
[ https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-6495: - Summary: Create metrics for HTTP API endpoint response codes. (was: Create metrics for HTTP API endpoint) > Create metrics for HTTP API endpoint response codes. > > > Key: MESOS-6495 > URL: https://issues.apache.org/jira/browse/MESOS-6495 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li > > We should have some metrics about various response code for (scheduler) HTTP > API (2xx, 4xx, etc) > [~anandmazumdar] suggested that ideally the solution could be easily extended > to cover other endpoints if we can directly enhance libprocess, so we can > cover other API (Master/Agent). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6499) Add metric to track active subscribers in operator API
Zhitao Li created MESOS-6499: Summary: Add metric to track active subscribers in operator API Key: MESOS-6499 URL: https://issues.apache.org/jira/browse/MESOS-6499 Project: Mesos Issue Type: Improvement Components: HTTP API Reporter: Zhitao Li Assignee: Zhitao Li -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3574) Support replacing ZooKeeper with replicated log
[ https://issues.apache.org/jira/browse/MESOS-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616044#comment-15616044 ] Zhitao Li commented on MESOS-3574: -- How will frameworks and agents detect where is, using replicated log? Are clients expected to hard code a list of master's ip:port and rely on redirect message from master? > Support replacing ZooKeeper with replicated log > --- > > Key: MESOS-3574 > URL: https://issues.apache.org/jira/browse/MESOS-3574 > Project: Mesos > Issue Type: Improvement > Components: leader election, replicated log >Reporter: Neil Conway > Labels: mesosphere > > It would be useful to support using the replicated log without also requiring > ZooKeeper to be running. This would simplify the process of > configuring/operating a high-availability configuration of Mesos. > At least three things would need to be done: > 1. Abstract away the stuff we use Zk for into an interface that can be > implemented (e.g., by etcd, consul, rep-log, or Zk). This might be done > already as part of [MESOS-1806] > 2. Enhance the replicated log to be able to do its own leader election + > failure detection (to decide when the current master is down). > 3. Validate replicated log performance to ensure it is adequate (per Joris, > likely needs some significant work) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6457) Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING.
[ https://issues.apache.org/jira/browse/MESOS-6457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15622713#comment-15622713 ] Zhitao Li commented on MESOS-6457: -- Is this behavior only possible when framework opt-in to use Mesos's own health check (aka custom executors which do not use Mesos health check should not affected)? > Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING. > - > > Key: MESOS-6457 > URL: https://issues.apache.org/jira/browse/MESOS-6457 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Blocker > > A task can currently transition from {{TASK_KILLING}} to {{TASK_RUNNING}}, if > for example it starts/stops passing a health check once it got into the > {{TASK_KILLING}} state. > I think that this behaviour is counterintuitive. It also makes the life of > framework/tools developers harder, since they have to keep track of the > complete task status history in order to know if a task is being killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6554) Create event stream capability in agent API
Zhitao Li created MESOS-6554: Summary: Create event stream capability in agent API Key: MESOS-6554 URL: https://issues.apache.org/jira/browse/MESOS-6554 Project: Mesos Issue Type: Wish Components: HTTP API Reporter: Zhitao Li Similar to event stream API in master, I hope we can have similar capabilities in agent API. Many container related integration projects uses APIs like [https://docs.docker.com/engine/reference/api/docker_remote_api_v1.24/#/monitor-dockers-events|docker event], and people need a solution if they want to use Mesos containerizer to run docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6162) Add support for cgroups blkio subsystem
[ https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li reassigned MESOS-6162: Assignee: Zhitao Li > Add support for cgroups blkio subsystem > --- > > Key: MESOS-6162 > URL: https://issues.apache.org/jira/browse/MESOS-6162 > Project: Mesos > Issue Type: Task >Reporter: haosdent >Assignee: Zhitao Li > > Noted that cgroups blkio subsystem may have performance issue, refer to > https://github.com/opencontainers/runc/issues/861 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-4945: - Comment: was deleted (was: Revised plan in rough steps: * For each image, checkpoint a) container ids, b) time of last container using it being destroyed, and c) size of each layer; ** TODO: how do deal with migration? idea is passing in more info in recover() chain of containerizer -> provisioner -> store; * Change store interface: ** "get(Image)" to "get(Image, ContainerID)": The containerID field added can be used to implement ref counting and further book keeping (i.e. get local images information); ** add "remove(Image, ContainerID)" virtual function: this is optional in that store which does not do ref counting can have an empty implementation. * Make sure provisioner::destroy() call store::remove(Image, ContainerID); * Add command line flag for docker store capacity limit (in bytes); * In (docker) store::get(Image, ContainerID), after a pull is done, calculate total layer sizes, if above store capacity, remove unused images (determined by empty container ids), sorted by last time not used. Any layer not shared by leftover images is also removed, until total size is dropped below capacity. Open question: 1) In this design, we have one explicit reference counting between {{Container}} and {{Image}} in store. However, this information could be constructed on-the-fly with all containers in {{Containerizer}} class. Do we consider this "double accounting" problematic, or error-prone? 2) Is calling new {{remove(Image, ContainerID)}} from {{Provisioner::destroy()}} sufficient to make sure all book keepings are properly done?) > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Epic >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.
[ https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714038#comment-15714038 ] Zhitao Li commented on MESOS-4945: -- [~gilbert] [~jieyu], I've put up a short design doc for this in https://docs.google.com/document/d/1TSn7HOFLWpF3TLRVe4XyLpv6B__A1tk-tU16B1ZbsCI/edit#. Please take a look and let me know if you see issues. If it looks good, I'll add more issues to this epic. > Garbage collect unused docker layers in the store. > -- > > Key: MESOS-4945 > URL: https://issues.apache.org/jira/browse/MESOS-4945 > Project: Mesos > Issue Type: Epic >Reporter: Jie Yu >Assignee: Zhitao Li > > Right now, we don't have any garbage collection in place for docker layers. > It's not straightforward to implement because we don't know what container is > currently using the layer. We probably need a way to track the current usage > of layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6495) Create metrics for HTTP API endpoint response codes.
[ https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li reassigned MESOS-6495: Assignee: Zhitao Li > Create metrics for HTTP API endpoint response codes. > > > Key: MESOS-6495 > URL: https://issues.apache.org/jira/browse/MESOS-6495 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li > > We should have some metrics about various response code for (scheduler) HTTP > API (2xx, 4xx, etc) > [~anandmazumdar] suggested that ideally the solution could be easily extended > to cover other endpoints if we can directly enhance libprocess, so we can > cover other API (Master/Agent). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6495) Create metrics for HTTP API endpoint response codes.
[ https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-6495: - Shepherd: Anand Mazumdar > Create metrics for HTTP API endpoint response codes. > > > Key: MESOS-6495 > URL: https://issues.apache.org/jira/browse/MESOS-6495 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li > > We should have some metrics about various response code for (scheduler) HTTP > API (2xx, 4xx, etc) > [~anandmazumdar] suggested that ideally the solution could be easily extended > to cover other endpoints if we can directly enhance libprocess, so we can > cover other API (Master/Agent). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6082) Add scheduler Call and Event based metrics to the master.
[ https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715838#comment-15715838 ] Zhitao Li commented on MESOS-6082: -- Hi [~a10gupta], we are hoping to get this done soon, ideally catching Mesos 1.2 release (scheduled sometime in Jan 2017). Are you actively working on this? If no, maybe unclaim this and I'll take a pass? Thanks! > Add scheduler Call and Event based metrics to the master. > - > > Key: MESOS-6082 > URL: https://issues.apache.org/jira/browse/MESOS-6082 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler >Assignee: Abhishek Dasgupta > > Currently, the master only has metrics for the old-style messages and these > are re-used for calls unfortunately: > {code} > // Messages from schedulers. > process::metrics::Counter messages_register_framework; > process::metrics::Counter messages_reregister_framework; > process::metrics::Counter messages_unregister_framework; > process::metrics::Counter messages_deactivate_framework; > process::metrics::Counter messages_kill_task; > process::metrics::Counter messages_status_update_acknowledgement; > process::metrics::Counter messages_resource_request; > process::metrics::Counter messages_launch_tasks; > process::metrics::Counter messages_decline_offers; > process::metrics::Counter messages_revive_offers; > process::metrics::Counter messages_suppress_offers; > process::metrics::Counter messages_reconcile_tasks; > process::metrics::Counter messages_framework_to_executor; > {code} > Now that we've introduced the Call/Event based API, we should have metrics > that reflect this. For example: > {code} > { > scheduler/calls: 100 > scheduler/calls/decline: 90, > scheduler/calls/accept: 10, > scheduler/calls/accept/operations/create: 1, > scheduler/calls/accept/operations/destroy: 0, > scheduler/calls/accept/operations/launch: 4, > scheduler/calls/accept/operations/launch_group: 2, > scheduler/calls/accept/operations/reserve: 1, > scheduler/calls/accept/operations/unreserve: 0, > scheduler/calls/kill: 0, > // etc > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1280) Add replace task primitive
[ https://issues.apache.org/jira/browse/MESOS-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755230#comment-15755230 ] Zhitao Li commented on MESOS-1280: -- Hi, is there any common interest in pursuing this in the next 1-2 Mesos release cycles? Our organization is quite interested in adding this capability for a couple of reasons, and would be happy if some committer is willing to shepherd us. Thanks! > Add replace task primitive > -- > > Key: MESOS-1280 > URL: https://issues.apache.org/jira/browse/MESOS-1280 > Project: Mesos > Issue Type: Bug > Components: agent, c++ api, master >Reporter: Niklas Quarfot Nielsen > Labels: mesosphere > > Also along the lines of MESOS-938, replaceTask would one of a couple of > primitives needed to support various task replacement and scaling scenarios. > This replaceTask() version is significantly simpler than the first proposed > one; it's only responsibility is to run a new task info on a running tasks > resources. > The running task will be killed as usual, but the newly freed resources will > never be announced and the new task will run on them instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6808) Refactor Docker::run to only take docker cli parameters
Zhitao Li created MESOS-6808: Summary: Refactor Docker::run to only take docker cli parameters Key: MESOS-6808 URL: https://issues.apache.org/jira/browse/MESOS-6808 Project: Mesos Issue Type: Task Components: docker Reporter: Zhitao Li Assignee: Zhitao Li Priority: Minor As we discussed, {{Docker::run}} in src/docker/docker.hpp should only understand docker cli options. The logic of creating these options should be refactored to another helper function. This will also allow us to overcome the maximum 10 argument limit of GMOCK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6831) Add metrics for `slave` libprocess' event queue
Zhitao Li created MESOS-6831: Summary: Add metrics for `slave` libprocess' event queue Key: MESOS-6831 URL: https://issues.apache.org/jira/browse/MESOS-6831 Project: Mesos Issue Type: Improvement Components: agent Reporter: Zhitao Li We have event queue metrics for master and allocator in http://mesos.apache.org/documentation/latest/monitoring/, but we don't have the event queue length for the most important libprocess actor in agent `slave`. I propose we add similar metrics to this actor. This is at least useful in debugging the issues of whether Mesos agent is overloaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)