[jira] [Updated] (MESOS-4052) Simple hook implementation proxying out to another daemon process

2015-12-02 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4052:
-
Description: 
Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would 
need to maintain the compiling, building and packaging of a dynamically linked 
library in c++ in house.

Designs like [Docker's Volume 
plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
user to implement a predefined REST API in any language and listen at a domain 
socket. This would be more flexible for companies that does not use c++ as 
primary language.

This ticket is exploring the possibility of whether Mesos could provide a 
default module that 1) defines such API and 2) proxies out to the external 
agent for any heavy lifting.

Please let me know whether you think is seems like a reasonable 
feature/requirement.

I'm more than happy to work on this than maintain this hook in house in the 
longer term.

  was:
Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would 
need to maintain the compiling, building and packaging of a dynamically linked 
library in c++ in house.

Designs like [Docker's Volume 
plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
user to implement a predefined REST API in any language and listen at a domain 
socket. This would be more flexible for companies that does not use c++ as 
primary language.

This ticket is exploring the possibility of whether Mesos could provide a 
default module that 1) defines such API and 2) proxies out to the external 
agent for any heavy lifting.

I'm more than happy to work on this than maintain this hook in house in the 
longer term.


> Simple hook implementation proxying out to another daemon process
> -
>
> Key: MESOS-4052
> URL: https://issues.apache.org/jira/browse/MESOS-4052
> Project: Mesos
>  Issue Type: Wish
>  Components: modules
>Reporter: Zhitao Li
>Priority: Minor
>
> Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they 
> would need to maintain the compiling, building and packaging of a dynamically 
> linked library in c++ in house.
> Designs like [Docker's Volume 
> plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
> user to implement a predefined REST API in any language and listen at a 
> domain socket. This would be more flexible for companies that does not use 
> c++ as primary language.
> This ticket is exploring the possibility of whether Mesos could provide a 
> default module that 1) defines such API and 2) proxies out to the external 
> agent for any heavy lifting.
> Please let me know whether you think is seems like a reasonable 
> feature/requirement.
> I'm more than happy to work on this than maintain this hook in house in the 
> longer term.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4052) Simple hook implementation proxying out to another daemon process

2015-12-02 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-4052:


 Summary: Simple hook implementation proxying out to another daemon 
process
 Key: MESOS-4052
 URL: https://issues.apache.org/jira/browse/MESOS-4052
 Project: Mesos
  Issue Type: Wish
  Components: modules
Reporter: Zhitao Li
Priority: Minor


Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would 
need to maintain the compiling, building and packaging of a dynamically linked 
library in c++ in house.

Designs like [Docker's Volume 
plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
user to implement a predefined REST API in any language and listen at a domain 
socket. This would be more flexible for companies that does not use c++ as 
primary language.

This ticket is exploring the possibility of whether Mesos could provide a 
default module that 1) defines such API and 2) proxies out to the external 
agent for any heavy lifting.

I'm more than happy to work on this than maintain this hook in house in the 
longer term.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-313) Report executor terminations to framework schedulers.

2015-12-14 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-313:

Summary: Report executor terminations to framework schedulers.  (was: 
report executor deaths to framework schedulers)

> Report executor terminations to framework schedulers.
> -
>
> Key: MESOS-313
> URL: https://issues.apache.org/jira/browse/MESOS-313
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Charles Reiss
>Assignee: Zhitao Li
>  Labels: mesosphere, newbie
>
> The Scheduler interface has a callback for executorLost, but currently it is 
> never called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox

2015-12-24 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071374#comment-15071374
 ] 

Zhitao Li commented on MESOS-3413:
--

It seem that this task is marked as "Won't fix". I wonder whether it's possible 
to come up with some short fix for the existing docker containerizer so users 
don't need to blocked until new Unified Containerizer is ready.

There seem to be two possible "quick" fixes:
- symlink the persistent volume into sandbox;
- directly mount the persistent volume.

The getPersistentVolumePath() in src/slave/paths.cpp is already available for 
this purpose.

Can someone comment on this possibility?

> Docker containerizer does not symlink persistent volumes into sandbox
> -
>
> Key: MESOS-3413
> URL: https://issues.apache.org/jira/browse/MESOS-3413
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker, slave
>Affects Versions: 0.23.0
>Reporter: Max Neunhöffer
>Assignee: haosdent
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the ArangoDB framework I am trying to use the persistent primitives. 
> nearly all is working, but I am missing a crucial piece at the end: I have 
> successfully created a persistent disk resource and have set the persistence 
> and volume information in the DiskInfo message. However, I do not see any way 
> to find out what directory on the host the mesos slave has reserved for us. I 
> know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we 
> have no way to query this information anywhere. The docker containerizer does 
> not automatically mount this directory into our docker container, or symlinks 
> it into our sandbox. Therefore, I have essentially no access to it. Note that 
> the mesos containerizer (which I cannot use for other reasons) seems to 
> create a symlink in the sandbox to the actual path for the persistent volume. 
> With that, I could mount the volume into our docker container and all would 
> be well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox

2015-12-29 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071374#comment-15071374
 ] 

Zhitao Li edited comment on MESOS-3413 at 12/29/15 6:25 PM:


It seem that this task is marked as "Won't fix". I wonder whether it's possible 
to come up with some short fix for the existing docker containerizer so users 
don't need to blocked until new Unified Containerizer is ready.

There seem to be two possible ways to fix:
- add some feedback response from slave to master/allocator to the resource 
offer containing the hostPath so offer resources containing this persisted 
volume has non-empty hostPath;
- at launch time, detect that a persisted volume is included in the 
ContainerInfo, and automatically mount the resolved host path.

The getPersistentVolumePath() in src/slave/paths.cpp is already available for 
this purpose.

Can someone comment on this possibility?


was (Author: zhitao):
It seem that this task is marked as "Won't fix". I wonder whether it's possible 
to come up with some short fix for the existing docker containerizer so users 
don't need to blocked until new Unified Containerizer is ready.

There seem to be two possible "quick" fixes:
- symlink the persistent volume into sandbox;
- directly mount the persistent volume.

The getPersistentVolumePath() in src/slave/paths.cpp is already available for 
this purpose.

Can someone comment on this possibility?

> Docker containerizer does not symlink persistent volumes into sandbox
> -
>
> Key: MESOS-3413
> URL: https://issues.apache.org/jira/browse/MESOS-3413
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker, slave
>Affects Versions: 0.23.0
>Reporter: Max Neunhöffer
>Assignee: haosdent
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the ArangoDB framework I am trying to use the persistent primitives. 
> nearly all is working, but I am missing a crucial piece at the end: I have 
> successfully created a persistent disk resource and have set the persistence 
> and volume information in the DiskInfo message. However, I do not see any way 
> to find out what directory on the host the mesos slave has reserved for us. I 
> know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we 
> have no way to query this information anywhere. The docker containerizer does 
> not automatically mount this directory into our docker container, or symlinks 
> it into our sandbox. Therefore, I have essentially no access to it. Note that 
> the mesos containerizer (which I cannot use for other reasons) seems to 
> create a symlink in the sandbox to the actual path for the persistent volume. 
> With that, I could mount the volume into our docker container and all would 
> be well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox

2015-12-29 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074220#comment-15074220
 ] 

Zhitao Li commented on MESOS-3413:
--

[~jieyu], thanks for the reply. My comment about symlink was indeed incorrect 
and I updated my previous comment. I think returning the host path in the 
resource offer could still work though, as long as Mesos slave would not change 
the real location of the volume created.

For the fact of unable to update volumes for a running container is a 
limitation, I acknowledge that's a limitation , but I don't really understand 
why it's a show stopper. Making it clear to users that for DockerContainerizer, 
all persistent volumes must be created and mounted before the 
executor/container is created still sounds reasonable to me, and it allows us 
to use current DockerContainerizer until the new Unified Containerizer is 
available and covers other features we may need from docker engine.

One of the reasons I really want this is that our in-house database system 
(which we are looking towards to run on Mesos) requires running multiple mysqld 
instances on the same machine, and that team already spent quite some time to 
dockerize these instances for easy configuration and isolation purpose.

Thanks for you time again. I really looking forward to using persistent volume 
primitives!

> Docker containerizer does not symlink persistent volumes into sandbox
> -
>
> Key: MESOS-3413
> URL: https://issues.apache.org/jira/browse/MESOS-3413
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker, slave
>Affects Versions: 0.23.0
>Reporter: Max Neunhöffer
>Assignee: haosdent
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the ArangoDB framework I am trying to use the persistent primitives. 
> nearly all is working, but I am missing a crucial piece at the end: I have 
> successfully created a persistent disk resource and have set the persistence 
> and volume information in the DiskInfo message. However, I do not see any way 
> to find out what directory on the host the mesos slave has reserved for us. I 
> know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we 
> have no way to query this information anywhere. The docker containerizer does 
> not automatically mount this directory into our docker container, or symlinks 
> it into our sandbox. Therefore, I have essentially no access to it. Note that 
> the mesos containerizer (which I cannot use for other reasons) seems to 
> create a symlink in the sandbox to the actual path for the persistent volume. 
> With that, I could mount the volume into our docker container and all would 
> be well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox

2015-12-29 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074278#comment-15074278
 ] 

Zhitao Li commented on MESOS-3413:
--

Thanks! [~haosd...@gmail.com] Do you have objection for reopening this task? I 
can try to get this done too if you won't have cycle for this.

> Docker containerizer does not symlink persistent volumes into sandbox
> -
>
> Key: MESOS-3413
> URL: https://issues.apache.org/jira/browse/MESOS-3413
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker, slave
>Affects Versions: 0.23.0
>Reporter: Max Neunhöffer
>Assignee: haosdent
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the ArangoDB framework I am trying to use the persistent primitives. 
> nearly all is working, but I am missing a crucial piece at the end: I have 
> successfully created a persistent disk resource and have set the persistence 
> and volume information in the DiskInfo message. However, I do not see any way 
> to find out what directory on the host the mesos slave has reserved for us. I 
> know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we 
> have no way to query this information anywhere. The docker containerizer does 
> not automatically mount this directory into our docker container, or symlinks 
> it into our sandbox. Therefore, I have essentially no access to it. Note that 
> the mesos containerizer (which I cannot use for other reasons) seems to 
> create a symlink in the sandbox to the actual path for the persistent volume. 
> With that, I could mount the volume into our docker container and all would 
> be well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4264) DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have memory cgroup mounted

2015-12-31 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4264:
-
Summary: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM 
running does not have memory cgroup mounted  (was: 
DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not 
have )

> DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not 
> have memory cgroup mounted
> ---
>
> Key: MESOS-4264
> URL: https://issues.apache.org/jira/browse/MESOS-4264
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, test
> Environment: docker: 1.9.1
> EC2
> kernel: 
> {code:none}
> $uname  -a
> Linux zhitao-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 
> (2015-09-19) x86_64 GNU/Linux
> $ mount | grep cgroup
> tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
> cgroup on /sys/fs/cgroup/systemd type cgroup 
> (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
> cgroup on /sys/fs/cgroup/cpuset type cgroup 
> (rw,nosuid,nodev,noexec,relatime,cpuset)
> cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup 
> (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
> cgroup on /sys/fs/cgroup/devices type cgroup 
> (rw,nosuid,nodev,noexec,relatime,devices)
> cgroup on /sys/fs/cgroup/freezer type cgroup 
> (rw,nosuid,nodev,noexec,relatime,freezer)
> cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup 
> (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
> cgroup on /sys/fs/cgroup/blkio type cgroup 
> (rw,nosuid,nodev,noexec,relatime,blkio)
> cgroup on /sys/fs/cgroup/perf_event type cgroup 
> (rw,nosuid,nodev,noexec,relatime,perf_event)
> {code}
>Reporter: Zhitao Li
>Priority: Minor
>
> With debug enabled, seeing following failure when running the tests as root:
> {code:none}
> [ RUN  ] DockerContainerizerTest.ROOT_DOCKER_Usage
> ABORT: 
> (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:109): 
> Result::get() but state == NONE
> *** Aborted at 1451549845 (unix time) try "date -d @1451549845" if you are 
> using GNU date ***
> PC: @ 0x7f9528ac7107 (unknown)
> *** SIGABRT (@0x18be0) received by PID 101344 (TID 0x7f951ef0e700) from PID 
> 101344; stack trace: ***
> @ 0x7f9529a788d0 (unknown)
> @ 0x7f9528ac7107 (unknown)
> @ 0x7f9528ac84e8 (unknown)
> @   0x96dd99 _Abort()
> @   0x96ddc7 _Abort()
> @   0x9c8714 Result<>::get()
> @ 0x7f952d871bef 
> mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics()
> @ 0x7f952d870bc2 
> _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi
> @ 0x7f952d871121 
> _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_
> @ 0x7f952d877d8d 
> _ZZZNK7process9_DeferredIZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv
> @ 0x7f952d87b8dd 
> _ZNSt17_Function_handlerIFN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f952d8ac919 std::function<>::operator()()
> @ 0x7f952d8a0b2a 
> _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_
> @ 0x7f952d8b5bc1 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f952e16270f std::function<>::operator()()
> @ 0x7f952e1479fe process::ProcessBase::visit()
> @ 0x7f952e14d9ba process::DispatchEvent::visit()
> @   0x96ed2e process::ProcessBase::serve()
> @ 0x7f952e143cda process::ProcessManager::resume()
> @ 0x7f952e140ded 
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f952e14d17a 
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f952e14d128 
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f952e14d0b8 
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEv

[jira] [Created] (MESOS-4264) DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not have

2015-12-31 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-4264:


 Summary: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the 
VM running does not have 
 Key: MESOS-4264
 URL: https://issues.apache.org/jira/browse/MESOS-4264
 Project: Mesos
  Issue Type: Bug
  Components: docker, test
 Environment: docker: 1.9.1
EC2
kernel: 
{code:none}
$uname  -a
Linux zhitao-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 
(2015-09-19) x86_64 GNU/Linux

$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup 
(rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup 
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup 
(rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup 
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup 
(rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup 
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup 
(rw,nosuid,nodev,noexec,relatime,perf_event)
{code}




Reporter: Zhitao Li
Priority: Minor


With debug enabled, seeing following failure when running the tests as root:
{code:none}
[ RUN  ] DockerContainerizerTest.ROOT_DOCKER_Usage
ABORT: (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:109): 
Result::get() but state == NONE
*** Aborted at 1451549845 (unix time) try "date -d @1451549845" if you are 
using GNU date ***
PC: @ 0x7f9528ac7107 (unknown)
*** SIGABRT (@0x18be0) received by PID 101344 (TID 0x7f951ef0e700) from PID 
101344; stack trace: ***
@ 0x7f9529a788d0 (unknown)
@ 0x7f9528ac7107 (unknown)
@ 0x7f9528ac84e8 (unknown)
@   0x96dd99 _Abort()
@   0x96ddc7 _Abort()
@   0x9c8714 Result<>::get()
@ 0x7f952d871bef 
mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics()
@ 0x7f952d870bc2 
_ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi
@ 0x7f952d871121 
_ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_
@ 0x7f952d877d8d 
_ZZZNK7process9_DeferredIZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv
@ 0x7f952d87b8dd 
_ZNSt17_Function_handlerIFN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f952d8ac919 std::function<>::operator()()
@ 0x7f952d8a0b2a 
_ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_
@ 0x7f952d8b5bc1 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
@ 0x7f952e16270f std::function<>::operator()()
@ 0x7f952e1479fe process::ProcessBase::visit()
@ 0x7f952e14d9ba process::DispatchEvent::visit()
@   0x96ed2e process::ProcessBase::serve()
@ 0x7f952e143cda process::ProcessManager::resume()
@ 0x7f952e140ded 
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f952e14d17a 
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f952e14d128 
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f952e14d0b8 
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f952e14cffd 
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f952e14cf7a 
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9529408970 (unknown)
@ 0x7f9529a710a4 start_thread
@ 0x7f9528b7804d (unknown)
{code}

I believe this is because we don't check {{memCgroup}} is {{SOME}} before using 
it in {{Docke

[jira] [Updated] (MESOS-4264) DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM used does not have memory cgroup mounted

2015-12-31 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4264:
-
Summary: DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM used 
does not have memory cgroup mounted  (was: 
DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM running does not 
have memory cgroup mounted)

> DockerContainerizerTest.ROOT_DOCKER_Usage fails when the VM used does not 
> have memory cgroup mounted
> 
>
> Key: MESOS-4264
> URL: https://issues.apache.org/jira/browse/MESOS-4264
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, test
> Environment: docker: 1.9.1
> EC2
> kernel: 
> {code:none}
> $uname  -a
> Linux zhitao-jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 
> (2015-09-19) x86_64 GNU/Linux
> $ mount | grep cgroup
> tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
> cgroup on /sys/fs/cgroup/systemd type cgroup 
> (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
> cgroup on /sys/fs/cgroup/cpuset type cgroup 
> (rw,nosuid,nodev,noexec,relatime,cpuset)
> cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup 
> (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
> cgroup on /sys/fs/cgroup/devices type cgroup 
> (rw,nosuid,nodev,noexec,relatime,devices)
> cgroup on /sys/fs/cgroup/freezer type cgroup 
> (rw,nosuid,nodev,noexec,relatime,freezer)
> cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup 
> (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
> cgroup on /sys/fs/cgroup/blkio type cgroup 
> (rw,nosuid,nodev,noexec,relatime,blkio)
> cgroup on /sys/fs/cgroup/perf_event type cgroup 
> (rw,nosuid,nodev,noexec,relatime,perf_event)
> {code}
>Reporter: Zhitao Li
>Priority: Minor
>
> With debug enabled, seeing following failure when running the tests as root:
> {code:none}
> [ RUN  ] DockerContainerizerTest.ROOT_DOCKER_Usage
> ABORT: 
> (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:109): 
> Result::get() but state == NONE
> *** Aborted at 1451549845 (unix time) try "date -d @1451549845" if you are 
> using GNU date ***
> PC: @ 0x7f9528ac7107 (unknown)
> *** SIGABRT (@0x18be0) received by PID 101344 (TID 0x7f951ef0e700) from PID 
> 101344; stack trace: ***
> @ 0x7f9529a788d0 (unknown)
> @ 0x7f9528ac7107 (unknown)
> @ 0x7f9528ac84e8 (unknown)
> @   0x96dd99 _Abort()
> @   0x96ddc7 _Abort()
> @   0x9c8714 Result<>::get()
> @ 0x7f952d871bef 
> mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics()
> @ 0x7f952d870bc2 
> _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi
> @ 0x7f952d871121 
> _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_
> @ 0x7f952d877d8d 
> _ZZZNK7process9_DeferredIZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv
> @ 0x7f952d87b8dd 
> _ZNSt17_Function_handlerIFN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f952d8ac919 std::function<>::operator()()
> @ 0x7f952d8a0b2a 
> _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_
> @ 0x7f952d8b5bc1 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f952e16270f std::function<>::operator()()
> @ 0x7f952e1479fe process::ProcessBase::visit()
> @ 0x7f952e14d9ba process::DispatchEvent::visit()
> @   0x96ed2e process::ProcessBase::serve()
> @ 0x7f952e143cda process::ProcessManager::resume()
> @ 0x7f952e140ded 
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f952e14d17a 
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f952e14d128 
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f952e14d0b8 
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wra

[jira] [Commented] (MESOS-3413) Docker containerizer does not symlink persistent volumes into sandbox

2016-01-04 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081819#comment-15081819
 ] 

Zhitao Li commented on MESOS-3413:
--

[~jieyu] and [~haosd...@gmail.com], I've put up 
https://reviews.apache.org/r/41892 for a first pass at unblocking 
DockerContainerizer users to use persistent volumes. Please let me know what 
you think.

> Docker containerizer does not symlink persistent volumes into sandbox
> -
>
> Key: MESOS-3413
> URL: https://issues.apache.org/jira/browse/MESOS-3413
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker, slave
>Affects Versions: 0.23.0
>Reporter: Max Neunhöffer
>Assignee: haosdent
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the ArangoDB framework I am trying to use the persistent primitives. 
> nearly all is working, but I am missing a crucial piece at the end: I have 
> successfully created a persistent disk resource and have set the persistence 
> and volume information in the DiskInfo message. However, I do not see any way 
> to find out what directory on the host the mesos slave has reserved for us. I 
> know it is ${MESOS_SLAVE_WORKDIR}/volumes/roles//_ but we 
> have no way to query this information anywhere. The docker containerizer does 
> not automatically mount this directory into our docker container, or symlinks 
> it into our sandbox. Therefore, I have essentially no access to it. Note that 
> the mesos containerizer (which I cannot use for other reasons) seems to 
> create a symlink in the sandbox to the actual path for the persistent volume. 
> With that, I could mount the volume into our docker container and all would 
> be well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3509) SlaveTest.TerminatingSlaveDoesNotReregister is flaky

2016-01-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099128#comment-15099128
 ] 

Zhitao Li commented on MESOS-3509:
--

I'll post some inconclusive findings since my last change seems to made this 
test from "flaky" to "failing".

In short, clock/timer implementation in {{libevent}} seems to allow a timer to 
be created *out of order* w.r.t to {{Clock::advance()}} in certain cases. As a 
result, timer created after Clock advancing code was not affected by that and 
instead waiting for a real wall time of 120s 
({{slave::REGISTER_RETRY_INTERVAL_MAX * 2}}), which is longer than the default 
15s of {{AWAIT_READY}}.

I took a snippet of log to run the test with {{--libevent}} and {{GLOG_v=3}}:
{panel}
I0114 23:25:16.880738 129300 authenticator.cpp:317] Authentication success
I0114 23:25:16.880803 129295 process.cpp:2502] Resuming master@127.0.0.1:31370 
at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.880844 129301 process.cpp:2502] Resuming 
crammd5_authenticator(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.880861 129295 master.cpp:5475] Successfully authenticated 
principal 'test-principal' at slave(1)@127.0.0.1:31370
I0114 23:25:16.880903 129301 authenticator.cpp:431] Authentication session 
cleanup for crammd5_authenticatee(3)@127.0.0.1:31370
I0114 23:25:16.880873 129298 process.cpp:2502] Resuming 
crammd5_authenticatee(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.880975 129301 process.cpp:2800] Donating thread to 
crammd5_authenticator_session(3)@127.0.0.1:31370 while waiting
I0114 23:25:16.880991 129298 authenticatee.cpp:298] Authentication success
I0114 23:25:16.881000 129301 process.cpp:2502] Resuming 
crammd5_authenticator_session(3)@127.0.0.1:31370 at 2016-01-14 
23:25:16.878512896+00:00
I0114 23:25:16.881027 129301 process.cpp:2607] Cleaning up 
crammd5_authenticator_session(3)@127.0.0.1:31370
I0114 23:25:16.881069 129298 process.cpp:2502] Resuming 
slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.881184 129300 process.cpp:2502] Resuming 
crammd5_authenticatee(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.881225 129300 process.cpp:2607] Cleaning up 
crammd5_authenticatee(3)@127.0.0.1:31370
I0114 23:25:16.881301 129298 slave.cpp:860] Successfully authenticated with 
master master@127.0.0.1:31370
I0114 23:25:16.881631 129296 process.cpp:2502] Resuming master@127.0.0.1:31370 
at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.881629 129298 slave.cpp:1254] Will retry registration in 
2.172151ms if necessary
I0114 23:25:16.881724 129298 clock.cpp:279] Created a timer for 
slave(1)@127.0.0.1:31370 in 2.172151ms in the future (2016-01-14 
23:25:16.880685047+00:00)
I0114 23:25:16.881906 129296 master.cpp:4314] Re-registering slave 
a9a5fba6-3191-424d-a1cf-5d12f35ada17-S0 at slave(1)@127.0.0.1:31370 (localhost)
I0114 23:25:16.882228 129294 process.cpp:2502] Resuming 
slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.882613 129295 process.cpp:2502] Resuming 
slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.882622 129296 master.cpp:4502] Sending updated checkpointed 
resources  to slave a9a5fba6-3191-424d-a1cf-5d12f35ada17-S0 at 
slave(1)@127.0.0.1:31370 (localhost)
I0114 23:25:16.882712 129295 pid.cpp:93] Attempting to parse 
'scheduler-72698b83-ea69-4f94-ac79-1fe005ba5ea9@127.0.0.1:31370' into a PID
W0114 23:25:16.882750 129295 slave.cpp:2162] Dropping updateFramework message 
for a9a5fba6-3191-424d-a1cf-5d12f35ada17- because the slave is in 
DISCONNECTED state
I0114 23:25:16.882935 129295 slave.cpp:2277] Updated checkpointed resources 
from  to 
I0114 23:25:16.883074 129302 clock.cpp:152] Handling timers up to 2016-01-14 
23:25:16.878512896+00:00
I0114 23:25:16.883116 129302 clock.cpp:197] Clock has settled
I0114 23:25:16.888927 129289 clock.cpp:465] Clock is settled
I0114 23:25:16.889067 129289 clock.cpp:381] Clock advanced (2mins) to 0x20681f0
I0114 23:25:16.889143 129302 clock.cpp:152] Handling timers up to 2016-01-14 
23:27:16.878512896+00:00
I0114 23:25:16.889176 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:16.880685047+00:00
I0114 23:25:16.889196 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:16.882896819+00:00
I0114 23:25:16.889209 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:16.965011968+00:00
I0114 23:25:16.889220 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:17.800440064+00:00
I0114 23:25:16.889231 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:18.572319400+00:00
I0114 23:25:16.889242 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:21.561927936+00:00
I0114 23:25:16.889253 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:21.847002880+00:00
I0114 23:25:16.889263 129302 clock.cpp:159] Have timeout(s) at 2016-01-

[jira] [Created] (MESOS-7852) Tighten error handling in slaveRunTaskLabelDecorator hook

2017-08-02 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7852:


 Summary: Tighten error handling in slaveRunTaskLabelDecorator hook
 Key: MESOS-7852
 URL: https://issues.apache.org/jira/browse/MESOS-7852
 Project: Mesos
  Issue Type: Bug
  Components: modules
Reporter: Zhitao Li


For whatever reason, the {{slaveRunTaskLabelDecorator}} allows the module 
author to return an error, but the hook manager "silently" suppresses the error 
in {{HookManager::slaveRunTaskLabelDecorator}} and proceed.

This creates some problems:

1) module author could incorrectly assume that an returned error could cause 
the task run to fail, but it's actually not the case;

2) module author has not way to instruct Mesos agent to stop the task launch if 
unrecoverable error happens.

I suggest we tighten the handling here to fail the task run if module reports 
an error. A module can still work around soft errors by just returning input 
labels as-is.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7868) Support virtual filesystem in `Files` interface

2017-08-08 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7868:


 Summary: Support virtual filesystem in `Files` interface
 Key: MESOS-7868
 URL: https://issues.apache.org/jira/browse/MESOS-7868
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Based on conversation with [~bmahler], the {{Files}} interface which is used in 
[/files/download | 
http://mesos.apache.org/documentation/latest/endpoints/files/download/] and 
other files endpoints intended to support virtual path look up, so caller can 
simply provide something like {{//latest}} to 
navigate and/or download file in sandbox.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7868) Support virtual filesystem in `Files` interface

2017-08-08 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7868:
-
Component/s: agent

> Support virtual filesystem in `Files` interface
> ---
>
> Key: MESOS-7868
> URL: https://issues.apache.org/jira/browse/MESOS-7868
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>
> Based on conversation with [~bmahler], the {{Files}} interface which is used 
> in [/files/download | 
> http://mesos.apache.org/documentation/latest/endpoints/files/download/] and 
> other files endpoints intended to support virtual path look up, so caller can 
> simply provide something like {{//latest}} to 
> navigate and/or download file in sandbox.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7874:


 Summary: Provide a consistent non-blocking preLaunch hook
 Key: MESOS-7874
 URL: https://issues.apache.org/jira/browse/MESOS-7874
 Project: Mesos
  Issue Type: Improvement
  Components: modules
Reporter: Zhitao Li
Assignee: Zhitao Li


Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Description: 
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.

  was:
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{masterLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.


> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own 
> problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-09 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Description: 
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.

  was:
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveLaunchTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.


> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking prelaunch hook to integrate with our own 
> secret management system, and this hook needs to work under both 
> {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
> executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
> (actually certain labels on it).
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7878) Add default value for http_framework_authenticators flag

2017-08-10 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7878:


 Summary: Add default value for http_framework_authenticators flag
 Key: MESOS-7878
 URL: https://issues.apache.org/jira/browse/MESOS-7878
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Priority: Minor


Based on http://mesos.apache.org/documentation/latest/configuration/, 
{{http_authenticator}} has a default value {{basic}} but 
{{http_framework_authenticators}} does not one.

Given that people running default Mesos distribution only has {{basic}} 
available, I feel that we should add a default value to this flag to avoid 
surprise to operators when they turn on http framework.

Proposing Greg to shepherd.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-10 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Description: 
Our use case: we need a non-blocking way to notify our secret management system 
during task launching sequence on agent. This mechanism needs to work for both 
{{DockerContainerizer}} and {{MesosContainerizer}}, and both {{custom 
executor}} and {{command executor}}, with proper access to labels on 
{{TaskInfo}}.

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.

  was:
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.


> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking way to notify our secret management 
> system during task launching sequence on agent. This mechanism needs to work 
> for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both 
> {{custom executor}} and {{command executor}}, with proper access to labels on 
> {{TaskInfo}}.
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7893) Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in DockerContainerizer

2017-08-15 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7893:


 Summary: Make sure slavePreLaunchDockerTaskExecutorDecorator is 
consistently called in DockerContainerizer
 Key: MESOS-7893
 URL: https://issues.apache.org/jira/browse/MESOS-7893
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


When {{DockerContainerizer}} and a non-command executor is used together, the 
hook in {{slavePreLaunchDockerTaskExecutorDecorator}} is called with {{TaskInfo 
= None()}}.

We should keep task info passed to the hook to provide consistent interface.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127628#comment-16127628
 ] 

Zhitao Li commented on MESOS-7874:
--

After some discussion we will do the following:
1. convert {{slaveRunTaskLabelDecorator}} and {{masterRunTaskLabelDecorator}} 
to be non-blocking;
2. ensure {{slavePreLaunchDockerTaskExecutorDecorator}} is called consistently 
in {{DockerContainerizer}}.

I will repurpose this task for the first action item, and file MESOS-7893 for 
the second.

> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking way to notify our secret management 
> system during task launching sequence on agent. This mechanism needs to work 
> for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both 
> {{custom executor}} and {{command executor}}, with proper access to labels on 
> {{TaskInfo}}.
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API

2017-08-15 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Summary: Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator 
to non-blocking API  (was: Provide a consistent non-blocking preLaunch hook)

> Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to 
> non-blocking API
> --
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking way to notify our secret management 
> system during task launching sequence on agent. This mechanism needs to work 
> for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both 
> {{custom executor}} and {{command executor}}, with proper access to labels on 
> {{TaskInfo}}.
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7893) Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in DockerContainerizer

2017-08-15 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-7893:


   Shepherd: Till Toenshoff
   Assignee: Zhitao Li
 Labels: docker hooks module  (was: )
Component/s: modules
 docker

> Make sure slavePreLaunchDockerTaskExecutorDecorator is consistently called in 
> DockerContainerizer
> -
>
> Key: MESOS-7893
> URL: https://issues.apache.org/jira/browse/MESOS-7893
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: docker, hooks, module
>
> When {{DockerContainerizer}} and a non-command executor is used together, the 
> hook in {{slavePreLaunchDockerTaskExecutorDecorator}} is called with 
> {{TaskInfo = None()}}.
> We should keep task info passed to the hook to provide consistent interface.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7874) Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API

2017-08-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127771#comment-16127771
 ] 

Zhitao Li commented on MESOS-7874:
--

About implementation:

The change to hook.hpp, hook/manager.hpp(cpp) should be relative 
straightforward.

For changes to {{Master}} class, I took a quick look and there seemed to be two 
different paths:

* Performing unblocking hook before `Master::_accept`
*  Pro:
* Can be done in parallel with authorization (the other nonblocking 
thing in operation);
* Simpler handling for sending messages to slave: because all things 
will be ready in {{Master::_accept}}, we can still send corresponding messages 
to slave (`CheckpointMessage` for RESERVE/UNRESERVE/..., `RunTask` or 
`RunTaskGroup` for running task/taskgroup, etc).
* Con:
* Task validation and authorization are not performed yet so hooks 
could seen tasks which never got launched
* technically it's always true if the agent disconnected/goes down, 
or the `send(slave->pid, message);` goes dropped. Framework are reliably told 
task status, but hooks are not delivered with it.
* More thoughts:
* Maybe we should consider creating a private helper struct on Master 
class to mutate `OfferOperation` (adding task label is only one of that), to 
facilitate further changes?
* Perform the hook inside `Master::_accept`
* Pro:
* We already know there is a pending task launching, so less code on 
this part;
* Con:
* To preserve the ordering for messages, we would need to change `void 
Master::_apply(...)` to ask it return a `Future` and cache it, and 
only send out all messages once everything is ready.

I'm inclined to go with first path, but some discussion with people more 
familiar with the large master code base is definitely welcomed.

Thanks!

> Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to 
> non-blocking API
> --
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking way to notify our secret management 
> system during task launching sequence on agent. This mechanism needs to work 
> for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both 
> {{custom executor}} and {{command executor}}, with proper access to labels on 
> {{TaskInfo}}.
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7874) Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to non-blocking API

2017-08-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127771#comment-16127771
 ] 

Zhitao Li edited comment on MESOS-7874 at 8/15/17 8:27 PM:
---

About implementation:

The change to hook.hpp, hook/manager.hpp(cpp) should be relative 
straightforward.

For changes to {{Master}} class, I took a quick look and there seemed to be two 
different paths:

* Performing unblocking hook before `Master::_accept`
**  Pro:
*** Can be done in parallel with authorization (the other nonblocking 
thing in operation);
*** Simpler handling for sending messages to slave: because all things 
will be ready in {{Master::_accept}}, we can still send corresponding messages 
to slave (`CheckpointMessage` for RESERVE/UNRESERVE/..., `RunTask` or 
`RunTaskGroup` for running task/taskgroup, etc).
** Con:
*** Task validation and authorization are not performed yet so hooks 
could seen tasks which never got launched
 technically it's always true if the agent disconnected/goes 
down, or the `send(slave->pid, message);` goes dropped. Framework are reliably 
told task status, but hooks are not delivered with it.
** More thoughts:
*** Maybe we should consider creating a private helper struct on Master 
class to mutate `OfferOperation` (adding task label is only one of that), to 
facilitate further changes?
* Perform the hook inside `Master::_accept`
** Pro:
*** We already know there is a pending task launching, so less code on 
this part;
** Con:
*** To preserve the ordering for messages, we would need to change 
`void Master::_apply(...)` to ask it return a `Future` and cache 
it, and only send out all messages once everything is ready.

I'm inclined to go with first path, but some discussion with people more 
familiar with the large master code base is definitely welcomed.

Thanks!


was (Author: zhitao):
About implementation:

The change to hook.hpp, hook/manager.hpp(cpp) should be relative 
straightforward.

For changes to {{Master}} class, I took a quick look and there seemed to be two 
different paths:

* Performing unblocking hook before `Master::_accept`
*  Pro:
* Can be done in parallel with authorization (the other nonblocking 
thing in operation);
* Simpler handling for sending messages to slave: because all things 
will be ready in {{Master::_accept}}, we can still send corresponding messages 
to slave (`CheckpointMessage` for RESERVE/UNRESERVE/..., `RunTask` or 
`RunTaskGroup` for running task/taskgroup, etc).
* Con:
* Task validation and authorization are not performed yet so hooks 
could seen tasks which never got launched
* technically it's always true if the agent disconnected/goes down, 
or the `send(slave->pid, message);` goes dropped. Framework are reliably told 
task status, but hooks are not delivered with it.
* More thoughts:
* Maybe we should consider creating a private helper struct on Master 
class to mutate `OfferOperation` (adding task label is only one of that), to 
facilitate further changes?
* Perform the hook inside `Master::_accept`
* Pro:
* We already know there is a pending task launching, so less code on 
this part;
* Con:
* To preserve the ordering for messages, we would need to change `void 
Master::_apply(...)` to ask it return a `Future` and cache it, and 
only send out all messages once everything is ready.

I'm inclined to go with first path, but some discussion with people more 
familiar with the large master code base is definitely welcomed.

Thanks!

> Convert slaveRunTaskLabelDecorator and masterRunTaskLabelDecorator to 
> non-blocking API
> --
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking way to notify our secret management 
> system during task launching sequence on agent. This mechanism needs to work 
> for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both 
> {{custom executor}} and {{command executor}}, with proper access to labels on 
> {{TaskInfo}}.
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I gue

[jira] [Created] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory

2017-08-17 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7899:


 Summary: Expose sandboxes using virtual paths and hide the agent 
work directory
 Key: MESOS-7899
 URL: https://issues.apache.org/jira/browse/MESOS-7899
 Project: Mesos
  Issue Type: Task
Reporter: Zhitao Li
Assignee: Zhitao Li


{{Files}} interface already supports a virtual file system. We should figure 
out a way to enable this in {{ /files/download}} endpoint to hide agent sandbox.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7868) Support virtual filesystem in `Files` interface

2017-08-17 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130854#comment-16130854
 ] 

Zhitao Li commented on MESOS-7868:
--

I filed MESOS-7899 for the task

> Support virtual filesystem in `Files` interface
> ---
>
> Key: MESOS-7868
> URL: https://issues.apache.org/jira/browse/MESOS-7868
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>
> Based on conversation with [~bmahler], the {{Files}} interface which is used 
> in [/files/download | 
> http://mesos.apache.org/documentation/latest/endpoints/files/download/] and 
> other files endpoints intended to support virtual path look up, so caller can 
> simply provide something like {{//latest}} to 
> navigate and/or download file in sandbox.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory

2017-08-17 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130905#comment-16130905
 ] 

Zhitao Li commented on MESOS-7899:
--

[~bmahler], what's your suggestion on the virtual path should be?

One idea I have is to use a relative path 
{{frameworks//executors//latest}}. By omitting root 
directory, files API on agent will default to browsing in 
{{//slaves/}}.

The alternative is to provide a fake {{/latest-agent-work-dir}} and mount the 
executor directory there. However, endpoint/API user still need to know what 
this fake path is to properly use it.

I prefer the relative path idea but want to go over it with you before 
starting. Thanks.

> Expose sandboxes using virtual paths and hide the agent work directory
> --
>
> Key: MESOS-7899
> URL: https://issues.apache.org/jira/browse/MESOS-7899
> Project: Mesos
>  Issue Type: Task
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> {{Files}} interface already supports a virtual file system. We should figure 
> out a way to enable this in {{ /files/download}} endpoint to hide agent 
> sandbox.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory

2017-09-01 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7899:
-
Shepherd: Benjamin Mahler

> Expose sandboxes using virtual paths and hide the agent work directory
> --
>
> Key: MESOS-7899
> URL: https://issues.apache.org/jira/browse/MESOS-7899
> Project: Mesos
>  Issue Type: Task
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> {{Files}} interface already supports a virtual file system. We should figure 
> out a way to enable this in {{ /files/download}} endpoint to hide agent 
> sandbox.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5893) mesos-executor should adopt and reap orphan child processes

2017-09-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155588#comment-16155588
 ] 

Zhitao Li commented on MESOS-5893:
--

Is this problem still there, [~jieyu]?

> mesos-executor should adopt and reap orphan child processes
> ---
>
> Key: MESOS-5893
> URL: https://issues.apache.org/jira/browse/MESOS-5893
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
> Environment: mesos compiled from git master ( 1.1.0 ) 
> {{../configure --enable-ssl --enable-libevent --prefix=/usr --enable-optimize 
> --enable-silent-rules --enable-xfs-disk-isolator}}
> isolators : 
> {{namespaces/pid,cgroups/cpu,cgroups/mem,filesystem/linux,docker/runtime,network/cni,docker/volume}}
>Reporter: Stéphane Cottin
>  Labels: containerizer
>
> mesos containerizer does not properly handle children death.
> discovered using marathon-lb, each topology update fork another haproxy,  the 
> old haproxy process should properly die after its last client connection is 
> terminated, but turn into a zombie.
> {noformat}
>  7716 ?Ssl0:00  |   \_ mesos-executor 
> --launcher_dir=/usr/libexec/mesos --sandbox_directory=/mnt/mesos/sandbox 
> --user=root --working_directory=/marathon-lb 
> --rootfs=/mnt/mesos/provisioner/containers/3b381d5c-7490-4dcd-ab4b-81051226075a/backends/overlay/rootfses/a4beacac-2d7e-445b-80c8-a9b4e480c491
>  7813 ?Ss 0:00  |   |   \_ sh -c /marathon-lb/run sse 
> --marathon https://marathon:8443 --auth-credentials user:pass --group 
> 'external' --ssl-certs /certs --max-serv-port-ip-per-task 20050
>  7823 ?S  0:00  |   |   |   \_ /bin/bash /marathon-lb/run sse 
> --marathon https://marathon:8443 --auth-credentials user:pass --group 
> external --ssl-certs /certs --max-serv-port-ip-per-task 20050
>  7827 ?S  0:00  |   |   |   \_ /usr/bin/runsv 
> /marathon-lb/service/haproxy
>  7829 ?S  0:00  |   |   |   |   \_ /bin/bash ./run
>  8879 ?S  0:00  |   |   |   |   \_ sleep 0.5
>  7828 ?Sl 0:00  |   |   |   \_ python3 
> /marathon-lb/marathon_lb.py --syslog-socket /dev/null --haproxy-config 
> /marathon-lb/haproxy.cfg --ssl-certs /certs --command sv reload 
> /marathon-lb/service/haproxy --sse --marathon https://marathon:8443 
> --auth-credentials user:pass --group external --max-serv-port-ip-per-task 
> 20050
>  7906 ?Zs 0:00  |   |   \_ [haproxy] 
>  8628 ?Zs 0:00  |   |   \_ [haproxy] 
>  8722 ?Ss 0:00  |   |   \_ haproxy -p /tmp/haproxy.pid -f 
> /marathon-lb/haproxy.cfg -D -sf 144 52
> {noformat}
> update: mesos-executor should be registered as a subreaper ( 
> http://man7.org/linux/man-pages/man2/prctl.2.html ) and propagate signals. 
> code sample: https://github.com/krallin/tini/blob/master/src/tini.c



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5582) Create a `cgroups/devices` isolator.

2017-09-07 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157331#comment-16157331
 ] 

Zhitao Li commented on MESOS-5582:
--

Can this be closed already?

> Create a `cgroups/devices` isolator.
> 
>
> Key: MESOS-5582
> URL: https://issues.apache.org/jira/browse/MESOS-5582
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, isolator, mesosphere
>
> Currently, all the logic for the `cgroups/devices` isolator is bundled into 
> the Nvidia GPU Isolator. We should abstract it out into it's own component 
> and remove the redundant logic from the Nvidia GPU Isolator. Assuming the 
> guaranteed ordering between isolators from MESOS-5581, we can be sure that 
> the dependency order between the `cgroups/devices` and `gpu/nvidia` isolators 
> is met.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7960) Deprecate non virtual path browse/read for sandbox

2017-09-11 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7960:


 Summary: Deprecate non virtual path browse/read for sandbox
 Key: MESOS-7960
 URL: https://issues.apache.org/jira/browse/MESOS-7960
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Priority: Minor


We added support to browse and read files in executor's latest sandbox run 
directory in Mesos-7899. We should remove support for physical path after Mesos 
2.0 because it requires the {{work_dir}} and {{agent_id}}, which are not 
necessary to expose to frameworks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content

2017-09-20 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009
 ] 

Zhitao Li commented on MESOS-7366:
--

[~jieyu], sorry for reviving this task, but we might have missed a case for 
{{unmount} in linux.cpp. [This unmount call 
|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
 can still fail if device is busy.


> Agent sandbox gc could accidentally delete the entire persistent volume 
> content
> ---
>
> Key: MESOS-7366
> URL: https://issues.apache.org/jira/browse/MESOS-7366
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Zhitao Li
>Assignee: Jie Yu
>Priority: Blocker
> Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) 
> executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  /runs//volume: Device or 
> resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources 
> cpus(cassandra-cstar-location-store, cassandra, {resource_id: 
> 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; 
> mem(cassandra-cstar-location-store, cassandra, {resource_id: 
> 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; 
> ports(cassandra-cstar-location-store, cassandra, {resource_id: 
> fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container 
> d5a56564-3e24-4c60-9919-746710b78377 for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting 
> '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff'
>  to 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
>  for persistent volume disk(cassandra-cstar-location-store, cassandra, 
> {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's 
> forked pid 6892 to 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume 
> disk(cassandra-cstar-loca

[jira] [Comment Edited] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content

2017-09-20 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009
 ] 

Zhitao Li edited comment on MESOS-7366 at 9/20/17 11:45 PM:


[~jieyu], sorry for reviving this task, but we might have missed a case for 
*unmount* in linux.cpp. [This unmount call 
|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
 can still fail if device is busy.



was (Author: zhitao):
[~jieyu], sorry for reviving this task, but we might have missed a case for 
{{unmount} in linux.cpp. [This unmount call 
|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
 can still fail if device is busy.


> Agent sandbox gc could accidentally delete the entire persistent volume 
> content
> ---
>
> Key: MESOS-7366
> URL: https://issues.apache.org/jira/browse/MESOS-7366
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Zhitao Li
>Assignee: Jie Yu
>Priority: Blocker
> Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) 
> executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  /runs//volume: Device or 
> resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources 
> cpus(cassandra-cstar-location-store, cassandra, {resource_id: 
> 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; 
> mem(cassandra-cstar-location-store, cassandra, {resource_id: 
> 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; 
> ports(cassandra-cstar-location-store, cassandra, {resource_id: 
> fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container 
> d5a56564-3e24-4c60-9919-746710b78377 for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting 
> '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff'
>  to 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
>  for persistent volume disk(cassandra-cstar-location-store, cassandra, 
> {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's 
> forked pid 6892 to 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at execu

[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart

2017-09-22 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177221#comment-16177221
 ] 

Zhitao Li commented on MESOS-1739:
--

Ping on this too. I'm willing to work on this in the next couple of months and 
push this to happen.

> Allow slave reconfiguration on restart
> --
>
> Key: MESOS-1739
> URL: https://issues.apache.org/jira/browse/MESOS-1739
> Project: Mesos
>  Issue Type: Epic
>Reporter: Patrick Reilly
>  Labels: external-volumes, mesosphere, myriad
>
> Make it so that either via a slave restart or a out of process "reconfigure" 
> ping, the attributes and resources of a slave can be updated to be a superset 
> of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-26 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8018:


 Summary: Allow framework to opt-in to forward executor's JWT token 
to the tasks
 Key: MESOS-8018
 URL: https://issues.apache.org/jira/browse/MESOS-8018
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Nested container API is an awesome feature and enabled a lot of interesting use 
cases. A pattern we have seen multiple times is that a task (often the only 
one) launched by default executor wants to further creates containers nested 
behind itself (or the executor) to run some different workload.

Because the entire request is 1) completely local to the executor container, 2) 
okay to be bounded within the executor's lifecycle, we'd like to allow the task 
to use the mesos agent API directly to create these nested containers. However, 
it creates a problem when we want to enable HTTP executor authentication 
because the JWT auth tokens are only available to the executor so the task's 
API request will be rejected.

Requiring framework owner to fork or create a custom executor simply for this 
purpose also seems a bit too heavy.

My proposal is to allow framework to opt-in with some field so that the 
launched task will receive certain environment variables from default executor, 
so the task can "act upon" the executor. One idea is to add a new field to 
allow certain environment variables to be forwarded from executor to task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184562#comment-16184562
 ] 

Zhitao Li commented on MESOS-8018:
--

[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree.

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184562#comment-16184562
 ] 

Zhitao Li edited comment on MESOS-8018 at 9/28/17 6:00 PM:
---

[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree, if other part of AuthZ permits that.


was (Author: zhitao):
[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree.

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184562#comment-16184562
 ] 

Zhitao Li edited comment on MESOS-8018 at 9/28/17 6:01 PM:
---

[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree, if other part of AuthZ permits that.

My rationale here is pretty much intentionally treating this task as an 
extension part of the executor. I'd argue this is simpler than forcing everyone 
to write an executor.


was (Author: zhitao):
[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree, if other part of AuthZ permits that.

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8040) Return nested containers in `GET_CONTAINERS` API call

2017-09-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8040:


 Summary: Return nested containers in `GET_CONTAINERS` API call
 Key: MESOS-8040
 URL: https://issues.apache.org/jira/browse/MESOS-8040
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


Right now, there is no way to directly query agent and know all nested 
containers' id, parent id and other information.

After talking to [~jieyu], `GET_CONTAINERS` API seems a good fit to return this 
information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8040) Return nested containers in `GET_CONTAINERS` API call

2017-09-28 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8040:
-
Component/s: containerization
 Issue Type: Improvement  (was: Bug)

> Return nested containers in `GET_CONTAINERS` API call
> -
>
> Key: MESOS-8040
> URL: https://issues.apache.org/jira/browse/MESOS-8040
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>
> Right now, there is no way to directly query agent and know all nested 
> containers' id, parent id and other information.
> After talking to [~jieyu], `GET_CONTAINERS` API seems a good fit to return 
> this information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.

2017-10-04 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191610#comment-16191610
 ] 

Zhitao Li commented on MESOS-6240:
--

+1

Taking out executor to agent API from TCP to domain socket will also reduce 
some potential security exposure of agent.

Is there a design doc for this work?

> Allow executor/agent communication over non-TCP/IP stream socket.
> -
>
> Key: MESOS-6240
> URL: https://issues.apache.org/jira/browse/MESOS-6240
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: Linux and Windows
>Reporter: Avinash Sridharan
>Assignee: Benjamin Hindman
>Priority: Critical
>  Labels: mesosphere
>
> Currently, the executor agent communication happens specifically over TCP 
> sockets. This works fine in most cases, but specifically for the 
> `MesosContainerizer` when containers are running on CNI networks, this mode 
> of communication starts imposing constraints on the CNI network. Since, now 
> there has to connectivity between the CNI network  (on which the executor is 
> running) and the agent. Introducing paths from a CNI network to the 
> underlying agent, at best, creates headaches for operators and at worst 
> introduces serious security holes in the network, since it is breaking the 
> isolation between the container CNI network and the host network (on which 
> the agent is running).
> In order to simplify/strengthen deployment of Mesos containers on CNI 
> networks we therefore need to move away from using TCP/IP sockets for 
> executor/agent communication. Since, executor and agent are guaranteed to run 
> on the same host, the above problems can be resolved if, for the 
> `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of 
> TCP/IP sockets for the executor/agent communication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8070) Bundled GRPC build does not build on Debian 8

2017-10-10 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8070:


 Summary: Bundled GRPC build does not build on Debian 8
 Key: MESOS-8070
 URL: https://issues.apache.org/jira/browse/MESOS-8070
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Assignee: Chun-Hung Hsiao


Debian 8 includes an outdated version of libc-ares-dev, which prevents bundled 
GRPC to build.

I believe [~chhsia0] already has a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8075) Add RWMutex to libprocess

2017-10-11 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8075:


 Summary: Add RWMutex to libprocess
 Key: MESOS-8075
 URL: https://issues.apache.org/jira/browse/MESOS-8075
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Zhitao Li


We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide better 
concurrecy protection for mutual exclusive actions, but allow high concurrency 
for actions which can be performed at the same time.

One use case is image garbage collection: the new API 
{{provisioner::pruneImages}} needs to be mutually exclusive from 
{{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8075) Add RWMutex to libprocess

2017-10-11 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8075:


Assignee: Zhitao Li

> Add RWMutex to libprocess
> -
>
> Key: MESOS-8075
> URL: https://issues.apache.org/jira/browse/MESOS-8075
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide 
> better concurrecy protection for mutual exclusive actions, but allow high 
> concurrency for actions which can be performed at the same time.
> One use case is image garbage collection: the new API 
> {{provisioner::pruneImages}} needs to be mutually exclusive from 
> {{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
> concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8079) Checkpoint and recover layers used to provision rootfs in provisioner

2017-10-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8079:


 Summary: Checkpoint and recover layers used to provision rootfs in 
provisioner
 Key: MESOS-8079
 URL: https://issues.apache.org/jira/browse/MESOS-8079
 Project: Mesos
  Issue Type: Task
  Components: provisioner
Reporter: Zhitao Li


This information will be necessary for {{provisioner}} to determine all layers 
of active containers, which we need to retain when image gc happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8075) Add RWMutex to libprocess

2017-10-12 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8075:
-
Shepherd: Benjamin Hindman

> Add RWMutex to libprocess
> -
>
> Key: MESOS-8075
> URL: https://issues.apache.org/jira/browse/MESOS-8075
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide 
> better concurrecy protection for mutual exclusive actions, but allow high 
> concurrency for actions which can be performed at the same time.
> One use case is image garbage collection: the new API 
> {{provisioner::pruneImages}} needs to be mutually exclusive from 
> {{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
> concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8090:
-
Affects Version/s: 1.4.0

> Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
> --
>
> Key: MESOS-8090
> URL: https://issues.apache.org/jira/browse/MESOS-8090
> Project: Mesos
>  Issue Type: Bug
>  Components: master, oversubscription
>Affects Versions: 1.4.0
>Reporter: Zhitao Li
>Assignee: Michael Park
>
> We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
> over-subscription enabled agent running 1.3.1 code.
> The crash line is:
> resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
> Stack trace in gdb:
> {panel:title=My title}
> #0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f22f3554448 in __GI_abort () at abort.c:89
> #2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
> src/utilities.cc:147
> #3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x7f22f61566cd in google::LogMessage::SendToLog (this= out>) at src/logging.cc:1412
> #5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
> src/logging.cc:1281
> #6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
> (this=, __in_chrg=) at src/logging.cc:1984
> #7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
> /mesos/src/common/resources.cpp:1051
> #8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
> (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
> #9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, 
> that=...) at /mesos/src/common/resources.cpp:1993
> #10 0x7f22f527f860 in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2016
> #11 0x7f22f527f91d in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2025
> #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
> _resources=...) at /mesos/src/common/resources.cpp:1277
> #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
> (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
> #14 0x7f22f550adc1 in 
> ProtobufProcess::_handlerM
>  (t=0x558137bbae70, method=
> (void 
> (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, 
> const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
>   const&)>, 
> data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
> at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
> #15 0x7f22f54c8791 in 
> ProtobufProcess::visit (this=0x558137bbae70, 
> event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
> #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
> (this=this@entry=0x558137bbae70, event=...) at 
> /mesos/src/master/master.cpp:1643
> #17 0x7f22f547014d in mesos::internal::master::Master::visit 
> (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
> #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
> /mesos/3rdparty/libprocess/include/process/process.hpp:87
> #19 process::ProcessManager::resume (this=, 
> process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
> #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
> /mesos/3rdparty/libprocess/src/process.cpp:2881
> #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
> #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
> #23 
> std::thread::_Impl()>
>  >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
> #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
> pthread_create.c:309
> #26 0x7f22f360662d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {panel}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8090:


 Summary: Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
 Key: MESOS-8090
 URL: https://issues.apache.org/jira/browse/MESOS-8090
 Project: Mesos
  Issue Type: Bug
  Components: master, oversubscription
Reporter: Zhitao Li
Assignee: Michael Park


We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:

resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/resources.cpp:1993
#10 0x7f22f527f860 in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2016
#11 0x7f22f527f91d in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2025
#12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
_resources=...) at /mesos/src/common/resources.cpp:1277
#13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
(this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
#14 0x7f22f550adc1 in 
ProtobufProcess::_handlerM
 (t=0x558137bbae70, method=
(void (mesos::internal::master::Master::*)(mesos::internal::master::Master 
* const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
, 
data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
#15 0x7f22f54c8791 in 
ProtobufProcess::visit (this=0x558137bbae70, 
event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
#16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
(this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643
#17 0x7f22f547014d in mesos::internal::master::Master::visit 
(this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
#18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#19 process::ProcessManager::resume (this=, 
process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
#20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
/mesos/3rdparty/libprocess/src/process.cpp:2881
#21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
#22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
#23 
std::thread::_Impl()>
 >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
#24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
pthread_create.c:309
#26 0x7f22f360662d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{panel}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8090:
-
Description: 
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:


{panel:title=My title}
resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
{panel}

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/resources.cpp:1993
#10 0x7f22f527f860 in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2016
#11 0x7f22f527f91d in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2025
#12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
_resources=...) at /mesos/src/common/resources.cpp:1277
#13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
(this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
#14 0x7f22f550adc1 in 
ProtobufProcess::_handlerM
 (t=0x558137bbae70, method=
(void (mesos::internal::master::Master::*)(mesos::internal::master::Master 
* const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
, 
data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
#15 0x7f22f54c8791 in 
ProtobufProcess::visit (this=0x558137bbae70, 
event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
#16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
(this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643
#17 0x7f22f547014d in mesos::internal::master::Master::visit 
(this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
#18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#19 process::ProcessManager::resume (this=, 
process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
#20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
/mesos/3rdparty/libprocess/src/process.cpp:2881
#21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
#22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
#23 
std::thread::_Impl()>
 >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
#24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
pthread_create.c:309
#26 0x7f22f360662d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{panel}



  was:
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:

resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/re

[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8090:
-
Description: 
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:

{code:none}
resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
{code}

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/resources.cpp:1993
#10 0x7f22f527f860 in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2016
#11 0x7f22f527f91d in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2025
#12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
_resources=...) at /mesos/src/common/resources.cpp:1277
#13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
(this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
#14 0x7f22f550adc1 in 
ProtobufProcess::_handlerM
 (t=0x558137bbae70, method=
(void (mesos::internal::master::Master::*)(mesos::internal::master::Master 
* const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
, 
data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
#15 0x7f22f54c8791 in 
ProtobufProcess::visit (this=0x558137bbae70, 
event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
#16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
(this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643
#17 0x7f22f547014d in mesos::internal::master::Master::visit 
(this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
#18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#19 process::ProcessManager::resume (this=, 
process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
#20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
/mesos/3rdparty/libprocess/src/process.cpp:2881
#21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
#22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
#23 
std::thread::_Impl()>
 >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
#24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
pthread_create.c:309
#26 0x7f22f360662d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{panel}



  was:
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:


{panel:title=My title}
resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
{panel}

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /

[jira] [Commented] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-17 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208103#comment-16208103
 ] 

Zhitao Li commented on MESOS-8090:
--

A quick attempt to fix: https://reviews.apache.org/r/63084/

> Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
> --
>
> Key: MESOS-8090
> URL: https://issues.apache.org/jira/browse/MESOS-8090
> Project: Mesos
>  Issue Type: Bug
>  Components: master, oversubscription
>Affects Versions: 1.4.0
>Reporter: Zhitao Li
>Assignee: Michael Park
>
> We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
> over-subscription enabled agent running 1.3.1 code.
> The crash line is:
> {code:none}
> resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
> {code}
> Stack trace in gdb:
> {panel:title=My title}
> #0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f22f3554448 in __GI_abort () at abort.c:89
> #2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
> src/utilities.cc:147
> #3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x7f22f61566cd in google::LogMessage::SendToLog (this= out>) at src/logging.cc:1412
> #5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
> src/logging.cc:1281
> #6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
> (this=, __in_chrg=) at src/logging.cc:1984
> #7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
> /mesos/src/common/resources.cpp:1051
> #8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
> (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
> #9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, 
> that=...) at /mesos/src/common/resources.cpp:1993
> #10 0x7f22f527f860 in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2016
> #11 0x7f22f527f91d in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2025
> #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
> _resources=...) at /mesos/src/common/resources.cpp:1277
> #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
> (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
> #14 0x7f22f550adc1 in 
> ProtobufProcess::_handlerM
>  (t=0x558137bbae70, method=
> (void 
> (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, 
> const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
>   const&)>, 
> data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
> at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
> #15 0x7f22f54c8791 in 
> ProtobufProcess::visit (this=0x558137bbae70, 
> event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
> #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
> (this=this@entry=0x558137bbae70, event=...) at 
> /mesos/src/master/master.cpp:1643
> #17 0x7f22f547014d in mesos::internal::master::Master::visit 
> (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
> #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
> /mesos/3rdparty/libprocess/include/process/process.hpp:87
> #19 process::ProcessManager::resume (this=, 
> process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
> #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
> /mesos/3rdparty/libprocess/src/process.cpp:2881
> #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
> #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
> #23 
> std::thread::_Impl()>
>  >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
> #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
> pthread_create.c:309
> #26 0x7f22f360662d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {panel}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume

2017-11-01 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8161:


 Summary: Potentially dangerous dangling mount when stopping task 
with persistent volume
 Key: MESOS-8161
 URL: https://issues.apache.org/jira/browse/MESOS-8161
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Priority: Critical


While we fixed a case in MESOS-7366 when an executor terminates, it seems like 
a very similar case can still happen if a task with a persistent volume 
terminates, executor still active, and [this unmount 
call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
 fails due to "device busy".

I believe if agent gc or something other things run on the host mount 
namespace, it is possible to lose persistent volume data because of this.

Agent log:

{code:none}
I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status 
update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task 
node-1__23fa9624-4608-404f-8d6f-0235559588
8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
executor(1)@10.70.142.140:36929
I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status 
update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888
f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK 
for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for 
task node-1__23fa9624-4608-404f-8d6f-0
2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update 
TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status 
update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f o
f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing 
UPDATE for status update TASK_RUNNING (UUID: 
c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING 
(UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status 
update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-0235559588
8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
executor(1)@10.70.142.140:36929
I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 
10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 
10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max 
allowed age: 6.283560425078715days
I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json 
from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 
10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json 
from 10.70.142.140:56414 with User-Agent='filebundle-agent'
I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status 
update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888
f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK 
for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for 
task node-1__23fa9624-4608-404f-8d6f-0
2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR 
(UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f
6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state 
TASK_ERROR
I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount 
'/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex
ecutor__cbf9

[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content

2017-11-01 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235004#comment-16235004
 ] 

Zhitao Li commented on MESOS-7366:
--

I filed MESOS-8161 for the other case.

> Agent sandbox gc could accidentally delete the entire persistent volume 
> content
> ---
>
> Key: MESOS-7366
> URL: https://issues.apache.org/jira/browse/MESOS-7366
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Zhitao Li
>Assignee: Jie Yu
>Priority: Blocker
> Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) 
> executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  /runs//volume: Device or 
> resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources 
> cpus(cassandra-cstar-location-store, cassandra, {resource_id: 
> 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; 
> mem(cassandra-cstar-location-store, cassandra, {resource_id: 
> 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; 
> ports(cassandra-cstar-location-store, cassandra, {resource_id: 
> fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container 
> d5a56564-3e24-4c60-9919-746710b78377 for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting 
> '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff'
>  to 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
>  for persistent volume disk(cassandra-cstar-location-store, cassandra, 
> {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's 
> forked pid 6892 to 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume 
> disk(cassandra-cstar-location-store, cassandra, {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for 
> 

[jira] [Assigned] (MESOS-8280) Mesos Containerizer GC should set 'layers' after checkpointing layer ids in provisioner.

2017-11-29 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8280:


Assignee: Zhitao Li

> Mesos Containerizer GC should set 'layers' after checkpointing layer ids in 
> provisioner.
> 
>
> Key: MESOS-8280
> URL: https://issues.apache.org/jira/browse/MESOS-8280
> Project: Mesos
>  Issue Type: Bug
>  Components: image-gc, provisioner
>Reporter: Gilbert Song
>Assignee: Zhitao Li
>Priority: Critical
>  Labels: containerizer, image-gc, provisioner
>
> {noformat}
> 1
> 22
> 33
> 44
> 1
> 22
> 33
> 44
> I1129 23:24:45.469543  6592 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/MVgVC7/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/MVgVC7/38135e3743e6dcb66bd1394b633053714333c7b7cf930bfeebfda660c06e/rootfs.overlay'
> I1129 23:24:45.473287  6592 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/MVgVC7/sha256:b56ae66c29370df48e7377c8f9baa744a3958058a766793f821dadcb144a4647
>  to rootfs 
> '/tmp/mesos/store/docker/staging/MVgVC7/b5815a31a59b66c909dbf6c670de78690d4b52649b8e283fc2bfd2594f61cca3/rootfs.overlay'
> I1129 23:24:45.582002  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/e28617c6dd2169bfe2b10017dfaa04bd7183ff840c4f78ebe73fca2a89effeb6/rootfs.overlay'
> I1129 23:24:45.589404  6595 metadata_manager.cpp:167] Successfully cached 
> image 'alpine'
> I1129 23:24:45.590204  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/be4ce2753831b8952a5b797cf45b2230e1befead6f5db0630bcb24a5f554255e/rootfs.overlay'
> I1129 23:24:45.595190  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/53b5066c5a7dff5d6f6ef0c1945572d6578c083d550d2a3d575b4cdf7460306f/rootfs.overlay'
> I1129 23:24:45.599500  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/rootfs.overlay'
> I1129 23:24:45.602047  6597 provisioner.cpp:506] Provisioning image rootfs 
> '/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/rootfses/b5d48445-848d-4274-a4f8-e909351ebc35'
>  for container 
> 3bbc3fd1-0138-43a9-94ba-d017d813daac.01de09c5-d8e9-412e-8825-a592d2c875e5 
> using overlay backend
> I1129 23:24:45.602751  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:1db09adb5ddd7f1a07b6d585a7db747a51c7bd17418d47e91f901bdf420abd66
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/rootfs.overlay'
> I1129 23:24:45.603054  6596 overlay.cpp:168] Created symlink 
> '/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/links'
>  -> '/tmp/xAWQ8y'
> I1129 23:24:45.604398  6596 overlay.cpp:196] Provisioning image rootfs with 
> overlayfs: 
> 'lowerdir=/tmp/xAWQ8y/1:/tmp/xAWQ8y/0,upperdir=/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/upperdir,workdir=/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/workdir'
> I1129 23:24:45.607802  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/rootfs.overlay'
> I1129 23:24:45.612139  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955

[jira] [Commented] (MESOS-8070) Bundled GRPC build does not build on Debian 8

2017-12-10 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285336#comment-16285336
 ] 

Zhitao Li commented on MESOS-8070:
--

[~gilbert], can we make sure this catches 1.5 release? Thanks!

> Bundled GRPC build does not build on Debian 8
> -
>
> Key: MESOS-8070
> URL: https://issues.apache.org/jira/browse/MESOS-8070
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Assignee: Chun-Hung Hsiao
> Fix For: 1.5.0
>
>
> Debian 8 includes an outdated version of libc-ares-dev, which prevents 
> bundled GRPC to build.
> I believe [~chhsia0] already has a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8323) Separate resource fetching timeout from executor_registere_timeout

2017-12-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8323:


 Summary: Separate resource fetching timeout from 
executor_registere_timeout
 Key: MESOS-8323
 URL: https://issues.apache.org/jira/browse/MESOS-8323
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Containers could have varying size on images/resources, so it's more desirable 
to have a separate timeout (in duration) which is separate from executor 
register timeout.

[~bmahler], can we also agree this should be customizable to each task launch 
request (which hopefully can provide a better value based on its knowledge of 
artifact size)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8324) Add succeeded metric to container launch in Mesos agent

2017-12-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8324:


 Summary: Add succeeded metric to container launch in Mesos agent
 Key: MESOS-8324
 URL: https://issues.apache.org/jira/browse/MESOS-8324
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Only metric on agent related to stability of containerizer is 
"slave/container_launch_errors" and it does not track standalone/nested 
containers.

I propose we add a container_launch_succeeded counter to track all container 
launches in containerizer, and also add make sure `error` counter tracks 
standalone and nested containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8353) Duplicate task for same framework on multiple agents crashes out master after failover

2017-12-20 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8353:


 Summary: Duplicate task for same framework on multiple agents 
crashes out master after failover
 Key: MESOS-8353
 URL: https://issues.apache.org/jira/browse/MESOS-8353
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


We have seen a mesos master crash loop after a leader failover. After more 
investigation, it seems that a same task ID was managed to be created onto 
multiple Mesos agents in the cluster. 

One possible logical sequence which can lead to such problem:

1. Task T1 was launched to master M1 on agent A1 for framework F;
2. Master M1 failed over to M2;
3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 
does not know previous T1 yet so it accepted it and sent to A2;
4. A1 reregistered: this probably crashed M2 (because same task cannot be added 
twice);
5. When M3 tries to come up after M2, it further crashes because both A1 and A2 
tried to add a T1 to the framework.

(I only have logs to prove the last step right now)

This happened on 1.4.0 masters.

Although this is probably triggered by incorrect retry logic on framework side, 
I wonder whether Mesos master should do extra protection to prevent such issue 
to happen. One possible idea to instruct one of the agents carrying tasks w/ 
duplicate ID to terminate corresponding tasks, or just refuse to reregister 
such agents and instruct them to shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8358) Create agent endpoints for pruning images

2017-12-22 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8358:


 Summary: Create agent endpoints for pruning images
 Key: MESOS-8358
 URL: https://issues.apache.org/jira/browse/MESOS-8358
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Assignee: Zhitao Li


This is a follow up on MESOS-4945, but we agreed that we should create a HTTP 
endpoint on agent to manually trigger image gc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8358) Create agent endpoints for pruning images

2017-12-22 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8358:
-
Issue Type: Improvement  (was: Bug)

> Create agent endpoints for pruning images
> -
>
> Key: MESOS-8358
> URL: https://issues.apache.org/jira/browse/MESOS-8358
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> This is a follow up on MESOS-4945, but we agreed that we should create a HTTP 
> endpoint on agent to manually trigger image gc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8365) Create AuthN support for prune images API

2017-12-28 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8365:
-
Target Version/s: 1.5.0

> Create AuthN support for prune images API
> -
>
> Key: MESOS-8365
> URL: https://issues.apache.org/jira/browse/MESOS-8365
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We want to make sure there is a way to configure AuthZ for new API added in 
> MESOS-8360.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8365) Create AuthN support for prune images API

2017-12-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8365:


 Summary: Create AuthN support for prune images API
 Key: MESOS-8365
 URL: https://issues.apache.org/jira/browse/MESOS-8365
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Assignee: Zhitao Li


We want to make sure there is a way to configure AuthZ for new API added in 
MESOS-8360.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.

2018-01-08 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316816#comment-16316816
 ] 

Zhitao Li commented on MESOS-4945:
--

That one is not necessarily part of this epic. I'll move it out.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>  Labels: Mesosphere
> Fix For: 1.5.0
>
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6893) Track total docker image layer size in store

2018-01-08 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6893:
-
   Priority: Minor  (was: Major)
Description: We want to give cluster operator some insights on total size 
of docker image layers in store so we can use it for monitoring purpose.
Component/s: containerization
 Issue Type: Improvement  (was: Task)
Summary: Track total docker image layer size in store  (was: Track 
docker layer size and access time)

> Track total docker image layer size in store
> 
>
> Key: MESOS-6893
> URL: https://issues.apache.org/jira/browse/MESOS-6893
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> We want to give cluster operator some insights on total size of docker image 
> layers in store so we can use it for monitoring purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor.

2016-10-10 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562761#comment-15562761
 ] 

Zhitao Li commented on MESOS-6134:
--

[~alexr], I'm fine with releasing this for 1.2, but I think the patch is 
already ready for a while and pretty uncontroversial. We have been running the 
cherry-pick for a while w/o any issue.

[~jieyu], do you think we can commit this in 1.1 to close it out.

> Port CFS quota support to Docker Containerizer using command executor.
> --
>
> Key: MESOS-6134
> URL: https://issues.apache.org/jira/browse/MESOS-6134
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> MESOS-2154 only partially fixed the CFS quota support in Docker 
> Containerizer: that fix only works for custom executor.
> This tracks the fix for command executor so we can declare this is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`

2016-10-13 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997
 ] 

Zhitao Li commented on MESOS-6177:
--

[~anandmazumdar], I'm strongly inclined to return the full {{AgentInfo}} 
instead of only {{AgentID}} for agents in {{recovered}} state.

My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.

This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.

> Return unregistered agents recovered from registrar in `GetAgents` and/or 
> `/state.json`
> ---
>
> Key: MESOS-6177
> URL: https://issues.apache.org/jira/browse/MESOS-6177
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case:
> This can be used for any software which talks to Mesos master to better 
> understand state of an unregistered agent after a master failover.
> If this information is available, the use case in MESOS-6174 can be handled 
> with a simpler decision of whether the corresponding agent is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`

2016-10-13 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997
 ] 

Zhitao Li edited comment on MESOS-6177 at 10/13/16 11:07 PM:
-

[~anandmazumdar], after some more thoughts, I'm inclined to return the full 
{{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state.

My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.

This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.


was (Author: zhitao):
[~anandmazumdar], I'm strongly inclined to return the full {{AgentInfo}} 
instead of only {{AgentID}} for agents in {{recovered}} state.

My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.

This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.

> Return unregistered agents recovered from registrar in `GetAgents` and/or 
> `/state.json`
> ---
>
> Key: MESOS-6177
> URL: https://issues.apache.org/jira/browse/MESOS-6177
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case:
> This can be used for any software which talks to Mesos master to better 
> understand state of an unregistered agent after a master failover.
> If this information is available, the use case in MESOS-6174 can be handled 
> with a simpler decision of whether the corresponding agent is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`

2016-10-13 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997
 ] 

Zhitao Li edited comment on MESOS-6177 at 10/14/16 1:24 AM:



(edited)

[~anandmazumdar], after some more thoughts, I'm inclined to return the full 
{{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state.

This has the benefit to help operators to know the hostname of the agent id 
which is not recovered yet without calling registry again.

-My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to 
pass.-

-This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.-



was (Author: zhitao):
[~anandmazumdar], after some more thoughts, I'm inclined to return the full 
{{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state.

My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to pass.

This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.

> Return unregistered agents recovered from registrar in `GetAgents` and/or 
> `/state.json`
> ---
>
> Key: MESOS-6177
> URL: https://issues.apache.org/jira/browse/MESOS-6177
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case:
> This can be used for any software which talks to Mesos master to better 
> understand state of an unregistered agent after a master failover.
> If this information is available, the use case in MESOS-6174 can be handled 
> with a simpler decision of whether the corresponding agent is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6177) Return unregistered agents recovered from registrar in `GetAgents` and/or `/state.json`

2016-10-13 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572997#comment-15572997
 ] 

Zhitao Li edited comment on MESOS-6177 at 10/14/16 1:25 AM:


(edited after recalling that pid is not in SlaveInfo. We should think about 
adding {{Address}} to {{SlaveInfo}} if possible but that has to be a different 
ticket)

[~anandmazumdar], after some more thoughts, I'm inclined to return the full 
{{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state.

This has the benefit to help operators to know the hostname of the agent id 
which is not recovered yet without calling registry again.

-My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to 
pass.-

-This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.-



was (Author: zhitao):

(edited)

[~anandmazumdar], after some more thoughts, I'm inclined to return the full 
{{AgentInfo}} instead of only {{AgentID}} for agents in {{recovered}} state.

This has the benefit to help operators to know the hostname of the agent id 
which is not recovered yet without calling registry again.

-My primary intention is to have a hold of {{pid}}, so the operator/subscriber 
can know the ip:port the agent is listening at. If we only return {{AgentID}}, 
the operator can do little additional babysitting steps to validate the state 
of the agent, except for waiting for {{--agent_reregistration_timeout}} to 
pass.-

-This is also pretty easy to implement IIUIC: we can simply change the 
{{slaves.recovered}} from {{hashset}} to {{hashmap}}. The {{SlaveInfo}} is already available after Registrar recovers 
it.-


> Return unregistered agents recovered from registrar in `GetAgents` and/or 
> `/state.json`
> ---
>
> Key: MESOS-6177
> URL: https://issues.apache.org/jira/browse/MESOS-6177
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case:
> This can be used for any software which talks to Mesos master to better 
> understand state of an unregistered agent after a master failover.
> If this information is available, the use case in MESOS-6174 can be handled 
> with a simpler decision of whether the corresponding agent is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-18 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-4945:


Assignee: Zhitao Li

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-18 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813
 ] 

Zhitao Li commented on MESOS-4945:
--

Current plan:

- Add a "cleanup" method to store interface, which takes a {{vector}} 
for "images in use";
- store can choose its own implementation of what it wants to cleanup. Deleted 
images will be returned in a {{Future>}};
- it's the job of Containerizer/Provisioner to actively prepare the list of 
"images in use"
- initially this can simply be done by traversing all active containers, if 
provisioner already has all information in its memory;
- Initial implementation will add a new flag indicating upper limit of size for 
docker store directory, and docker::store will delete images until it drops 
below there;
- The invocation to store::cleanup can happen either in a background timer, 
upon provisioner::destroy, or before the pull? (I have no real preference, but 
calling it before pull seems safest if we use space based policy?);
- Initial implementation on store will traverse all images in the store;
- Further optimization including implementing a reference counting and size 
counting of all images in store, and checkpointing them. We might also need 
some kind of LRU implementation here.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6415) Create an unit test for OOM in Mesos containerizer's mem isolator

2016-10-18 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6415:


 Summary: Create an unit test for OOM in Mesos containerizer's mem 
isolator
 Key: MESOS-6415
 URL: https://issues.apache.org/jira/browse/MESOS-6415
 Project: Mesos
  Issue Type: Improvement
  Components: testing
Reporter: Zhitao Li
Assignee: Zhitao Li
Priority: Minor


It seems like we don't have any integration test practicing the case of 
exceeding container memory limit.

We could add one to cgroups_isolator_tests.cpp.

Good starting task for anyone interested in this area, including myself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6415) Create an unit test for OOM in Mesos containerizer's mem isolator

2016-10-18 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587117#comment-15587117
 ] 

Zhitao Li commented on MESOS-6415:
--

[~jieyu] and I chatted on the containerizer slack, and I exactly want to pursue 
this as a separate test. 

Slack history:
```

It seems like we don't have any integration test practicing the case of 
exceeding container memory limit. @jieyu @gilbert ?

Jie Yu [4:43 PM]  
we do have a balloon framework

Zhitao Li [4:44 PM]  
Is it exercised through a test?

Jie Yu [4:44 PM]  
yeah

Gilbert Song [4:44 PM]  
yes,

[4:44]  
through a script in a unit test

Jie Yu [4:44 PM]  
in retrospect, we can simply use a command task

[4:45]  
at the time balloon framework was written, command task does not exist yet

Zhitao Li [4:45 PM]  
I'd volunteer me or someone from our team to write a smaller test, if you want 
to shepherd (edited)

Jie Yu [4:45 PM]  
yup, i’d be happy to shepherd

[4:45]  
you should add one to cgroups_isolator_tests.cpp

Zhitao Li [4:46 PM]  
Will file an issue and claim it under my umbrella for now. Thanks

Jie Yu [4:46 PM]  
oh

[4:46]  
hold on

[4:46]  
we do have MemoryPressureMesosTest

[4:48]  
but I guess we don’t have a oom test

[4:48]  
memory pressure is mainly for the stats

[4:49]  
yeah, @zhitao, we should add a OOM test
```

> Create an unit test for OOM in Mesos containerizer's mem isolator
> -
>
> Key: MESOS-6415
> URL: https://issues.apache.org/jira/browse/MESOS-6415
> Project: Mesos
>  Issue Type: Improvement
>  Components: testing
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> It seems like we don't have any integration test practicing the case of 
> exceeding container memory limit.
> We could add one to cgroups_isolator_tests.cpp.
> Good starting task for anyone interested in this area, including myself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6415) Create an unit test for OOM in Mesos containerizer's mem isolator

2016-10-18 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587117#comment-15587117
 ] 

Zhitao Li edited comment on MESOS-6415 at 10/19/16 12:10 AM:
-

[~jieyu] and I chatted on the containerizer slack, and I exactly want to pursue 
this as a separate test. 

Slack history:
{quote}
It seems like we don't have any integration test practicing the case of 
exceeding container memory limit. @jieyu @gilbert ?

Jie Yu [4:43 PM]  
we do have a balloon framework

Zhitao Li [4:44 PM]  
Is it exercised through a test?

Jie Yu [4:44 PM]  
yeah

Gilbert Song [4:44 PM]  
yes,

[4:44]  
through a script in a unit test

Jie Yu [4:44 PM]  
in retrospect, we can simply use a command task

[4:45]  
at the time balloon framework was written, command task does not exist yet

Zhitao Li [4:45 PM]  
I'd volunteer me or someone from our team to write a smaller test, if you want 
to shepherd (edited)

Jie Yu [4:45 PM]  
yup, i’d be happy to shepherd

[4:45]  
you should add one to cgroups_isolator_tests.cpp

Zhitao Li [4:46 PM]  
Will file an issue and claim it under my umbrella for now. Thanks

Jie Yu [4:46 PM]  
oh

[4:46]  
hold on

[4:46]  
we do have MemoryPressureMesosTest

[4:48]  
but I guess we don’t have a oom test

[4:48]  
memory pressure is mainly for the stats

[4:49]  
yeah, @zhitao, we should add a OOM test
{quote}


was (Author: zhitao):
[~jieyu] and I chatted on the containerizer slack, and I exactly want to pursue 
this as a separate test. 

Slack history:
```

It seems like we don't have any integration test practicing the case of 
exceeding container memory limit. @jieyu @gilbert ?

Jie Yu [4:43 PM]  
we do have a balloon framework

Zhitao Li [4:44 PM]  
Is it exercised through a test?

Jie Yu [4:44 PM]  
yeah

Gilbert Song [4:44 PM]  
yes,

[4:44]  
through a script in a unit test

Jie Yu [4:44 PM]  
in retrospect, we can simply use a command task

[4:45]  
at the time balloon framework was written, command task does not exist yet

Zhitao Li [4:45 PM]  
I'd volunteer me or someone from our team to write a smaller test, if you want 
to shepherd (edited)

Jie Yu [4:45 PM]  
yup, i’d be happy to shepherd

[4:45]  
you should add one to cgroups_isolator_tests.cpp

Zhitao Li [4:46 PM]  
Will file an issue and claim it under my umbrella for now. Thanks

Jie Yu [4:46 PM]  
oh

[4:46]  
hold on

[4:46]  
we do have MemoryPressureMesosTest

[4:48]  
but I guess we don’t have a oom test

[4:48]  
memory pressure is mainly for the stats

[4:49]  
yeah, @zhitao, we should add a OOM test
```

> Create an unit test for OOM in Mesos containerizer's mem isolator
> -
>
> Key: MESOS-6415
> URL: https://issues.apache.org/jira/browse/MESOS-6415
> Project: Mesos
>  Issue Type: Improvement
>  Components: testing
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> It seems like we don't have any integration test practicing the case of 
> exceeding container memory limit.
> We could add one to cgroups_isolator_tests.cpp.
> Good starting task for anyone interested in this area, including myself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-19 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813
 ] 

Zhitao Li edited comment on MESOS-4945 at 10/19/16 8:58 AM:


Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)",
***The containerID field added can be used to implement ref counting 
and further book keeping (i.e. get local images information);
**add "remove(Image, ContainerID)" virtual function;
  *** this is optional: store which does not do ref counting can skip 
implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit;
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove images with empty container 
ids (aka not used), sorted by last time not used. Any layer not used is also 
removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?


was (Author: zhitao):
Current plan:

- Add a "cleanup" method to store interface, which takes a {{vector}} 
for "images in use";
- store can choose its own implementation of what it wants to cleanup. Deleted 
images will be returned in a {{Future>}};
- it's the job of Containerizer/Provisioner to actively prepare the list of 
"images in use"
- initially this can simply be done by traversing all active containers, if 
provisioner already has all information in its memory;
- Initial implementation will add a new flag indicating upper limit of size for 
docker store directory, and docker::store will delete images until it drops 
below there;
- The invocation to store::cleanup can happen either in a background timer, 
upon provisioner::destroy, or before the pull? (I have no real preference, but 
calling it before pull seems safest if we use space based policy?);
- Initial implementation on store will traverse all images in the store;
- Further optimization including implementing a reference counting and size 
counting of all images in store, and checkpointing them. We might also need 
some kind of LRU implementation here.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-19 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813
 ] 

Zhitao Li edited comment on MESOS-4945 at 10/19/16 9:03 AM:


Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can skip implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?


was (Author: zhitao):
Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)",
***The containerID field added can be used to implement ref counting 
and further book keeping (i.e. get local images information);
**add "remove(Image, ContainerID)" virtual function;
  *** this is optional: store which does not do ref counting can skip 
implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit;
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove images with empty container 
ids (aka not used), sorted by last time not used. Any layer not used is also 
removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-19 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586813#comment-15586813
 ] 

Zhitao Li edited comment on MESOS-4945 at 10/19/16 4:39 PM:


Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can have an empty implementation.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?


was (Author: zhitao):
Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can skip implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6429) Create metrics for docker store

2016-10-20 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6429:


 Summary: Create metrics for docker store
 Key: MESOS-6429
 URL: https://issues.apache.org/jira/browse/MESOS-6429
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Ideas of metrics we have right now (in order of importance)
- size of store (gauge)
- amount of data pulled (counter)
- number of layers cached (gauge)
- number of pulls (counter: rate can be derived externally);

Suggestion on what metrics we should still add is welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-20 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4945:
-
Shepherd: Jie Yu

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6429) Create metrics for docker store

2016-10-20 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6429:
-
Shepherd: Jie Yu
Assignee: Zhitao Li

> Create metrics for docker store
> ---
>
> Key: MESOS-6429
> URL: https://issues.apache.org/jira/browse/MESOS-6429
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Ideas of metrics we have right now (in order of importance)
> - size of store (gauge)
> - amount of data pulled (counter)
> - number of layers cached (gauge)
> - number of pulls (counter: rate can be derived externally);
> Suggestion on what metrics we should still add is welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6434) Publicize the test infrastructure for modules

2016-10-20 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6434:


 Summary: Publicize the test infrastructure for modules
 Key: MESOS-6434
 URL: https://issues.apache.org/jira/browse/MESOS-6434
 Project: Mesos
  Issue Type: Task
  Components: testing
Reporter: Zhitao Li


As we discussed in today's meeting, the goal is to allow module authors to use 
Mesos's internal testing infrastructure to test their own modules and replace 
hacks like 
https://github.com/dcos/dcos-mesos-modules/blob/bb6f6b22138ae38c9c8305e571deca2e4df7f3b3/configure.ac#L342-L359

Some action items I recall:
- clean up existing headers and make them nice to be installed;
- determine whether we will allow unversioned protobuf or only v1 protobuf;
- create library like libmesos_tests which module authors can link against.

[~kaysoky] [~jvanremoortere], please help to fill up more details and triage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6451) Add timer and percentile for docker pull latency distribution.

2016-10-21 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6451:


 Summary: Add timer and percentile for docker pull latency 
distribution.
 Key: MESOS-6451
 URL: https://issues.apache.org/jira/browse/MESOS-6451
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Zhitao Li
Assignee: Zhitao Li


The proposal here is to add a timer for both Mesos Containerizer and Docker 
containerizer to monitor latency distribution of pulling images.

This can be used for operators who operates either containerizer in production, 
and used for migration phase to understand performance variation if any.

I plan to use one hour look back window for this timer, unless there is other 
concern.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6451) Add timer and percentile for docker pull latency distribution.

2016-10-21 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596439#comment-15596439
 ] 

Zhitao Li commented on MESOS-6451:
--

https://reviews.apache.org/r/53105/ for DockerContainerizer.

> Add timer and percentile for docker pull latency distribution.
> --
>
> Key: MESOS-6451
> URL: https://issues.apache.org/jira/browse/MESOS-6451
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> The proposal here is to add a timer for both Mesos Containerizer and Docker 
> containerizer to monitor latency distribution of pulling images.
> This can be used for operators who operates either containerizer in 
> production, and used for migration phase to understand performance variation 
> if any.
> I plan to use one hour look back window for this timer, unless there is other 
> concern.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6495) Create metrics for HTTP API endpoint

2016-10-27 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6495:


 Summary: Create metrics for HTTP API endpoint
 Key: MESOS-6495
 URL: https://issues.apache.org/jira/browse/MESOS-6495
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


We should have some metrics about various response code for (scheduler) HTTP 
API (2xx, 4xx, etc)

[~anandmazumdar] suggested that ideally the solution could be easily extended 
to cover other endpoints if we can directly enhance libprocess, so we can cover 
other API (Master/Agent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6495) Create metrics for HTTP API endpoint response codes.

2016-10-27 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6495:
-
Summary: Create metrics for HTTP API endpoint response codes.  (was: Create 
metrics for HTTP API endpoint)

> Create metrics for HTTP API endpoint response codes.
> 
>
> Key: MESOS-6495
> URL: https://issues.apache.org/jira/browse/MESOS-6495
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> We should have some metrics about various response code for (scheduler) HTTP 
> API (2xx, 4xx, etc)
> [~anandmazumdar] suggested that ideally the solution could be easily extended 
> to cover other endpoints if we can directly enhance libprocess, so we can 
> cover other API (Master/Agent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6499) Add metric to track active subscribers in operator API

2016-10-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6499:


 Summary: Add metric to track active subscribers in operator API
 Key: MESOS-6499
 URL: https://issues.apache.org/jira/browse/MESOS-6499
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Zhitao Li
Assignee: Zhitao Li






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3574) Support replacing ZooKeeper with replicated log

2016-10-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616044#comment-15616044
 ] 

Zhitao Li commented on MESOS-3574:
--

How will frameworks and agents detect where is, using replicated log? Are 
clients expected to hard code a list of master's ip:port and rely on redirect 
message from master?

> Support replacing ZooKeeper with replicated log
> ---
>
> Key: MESOS-3574
> URL: https://issues.apache.org/jira/browse/MESOS-3574
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, replicated log
>Reporter: Neil Conway
>  Labels: mesosphere
>
> It would be useful to support using the replicated log without also requiring 
> ZooKeeper to be running. This would simplify the process of 
> configuring/operating a high-availability configuration of Mesos.
> At least three things would need to be done:
> 1. Abstract away the stuff we use Zk for into an interface that can be 
> implemented (e.g., by etcd, consul, rep-log, or Zk). This might be done 
> already as part of [MESOS-1806]
> 2. Enhance the replicated log to be able to do its own leader election + 
> failure detection (to decide when the current master is down).
> 3. Validate replicated log performance to ensure it is adequate (per Joris, 
> likely needs some significant work)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6457) Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING.

2016-10-31 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15622713#comment-15622713
 ] 

Zhitao Li commented on MESOS-6457:
--

Is this behavior only possible when framework opt-in to use Mesos's own health 
check (aka custom executors which do not use Mesos health check should not 
affected)?

> Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING.
> -
>
> Key: MESOS-6457
> URL: https://issues.apache.org/jira/browse/MESOS-6457
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Blocker
>
> A task can currently transition from {{TASK_KILLING}} to {{TASK_RUNNING}}, if 
> for example it starts/stops passing a health check once it got into the 
> {{TASK_KILLING}} state.
> I think that this behaviour is counterintuitive. It also makes the life of 
> framework/tools developers harder, since they have to keep track of the 
> complete task status history in order to know if a task is being killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6554) Create event stream capability in agent API

2016-11-04 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6554:


 Summary: Create event stream capability in agent API
 Key: MESOS-6554
 URL: https://issues.apache.org/jira/browse/MESOS-6554
 Project: Mesos
  Issue Type: Wish
  Components: HTTP API
Reporter: Zhitao Li


Similar to event stream API in master, I hope we can have similar capabilities 
in agent API.

Many container related integration projects uses APIs like 
[https://docs.docker.com/engine/reference/api/docker_remote_api_v1.24/#/monitor-dockers-events|docker
 event], and people need a solution if they want to use Mesos containerizer to 
run docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6162) Add support for cgroups blkio subsystem

2016-11-09 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-6162:


Assignee: Zhitao Li

> Add support for cgroups blkio subsystem
> ---
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: Zhitao Li
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-12-01 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4945:
-
Comment: was deleted

(was: Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can have an empty implementation.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?)

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-12-01 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714038#comment-15714038
 ] 

Zhitao Li commented on MESOS-4945:
--

[~gilbert] [~jieyu], I've put up a short design doc for this in 
https://docs.google.com/document/d/1TSn7HOFLWpF3TLRVe4XyLpv6B__A1tk-tU16B1ZbsCI/edit#.

Please take a look and let me know if you see issues.

If it looks good, I'll add more issues to this epic.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6495) Create metrics for HTTP API endpoint response codes.

2016-12-02 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-6495:


Assignee: Zhitao Li

> Create metrics for HTTP API endpoint response codes.
> 
>
> Key: MESOS-6495
> URL: https://issues.apache.org/jira/browse/MESOS-6495
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We should have some metrics about various response code for (scheduler) HTTP 
> API (2xx, 4xx, etc)
> [~anandmazumdar] suggested that ideally the solution could be easily extended 
> to cover other endpoints if we can directly enhance libprocess, so we can 
> cover other API (Master/Agent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6495) Create metrics for HTTP API endpoint response codes.

2016-12-02 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6495:
-
Shepherd: Anand Mazumdar

> Create metrics for HTTP API endpoint response codes.
> 
>
> Key: MESOS-6495
> URL: https://issues.apache.org/jira/browse/MESOS-6495
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We should have some metrics about various response code for (scheduler) HTTP 
> API (2xx, 4xx, etc)
> [~anandmazumdar] suggested that ideally the solution could be easily extended 
> to cover other endpoints if we can directly enhance libprocess, so we can 
> cover other API (Master/Agent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6082) Add scheduler Call and Event based metrics to the master.

2016-12-02 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715838#comment-15715838
 ] 

Zhitao Li commented on MESOS-6082:
--

Hi [~a10gupta], we are hoping to get this done soon, ideally catching Mesos 1.2 
release (scheduled sometime in Jan 2017). Are you actively working on this? If 
no, maybe unclaim this and I'll take a pass?

Thanks!

> Add scheduler Call and Event based metrics to the master.
> -
>
> Key: MESOS-6082
> URL: https://issues.apache.org/jira/browse/MESOS-6082
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Abhishek Dasgupta
>
> Currently, the master only has metrics for the old-style messages and these 
> are re-used for calls unfortunately:
> {code}
>   // Messages from schedulers.
>   process::metrics::Counter messages_register_framework;
>   process::metrics::Counter messages_reregister_framework;
>   process::metrics::Counter messages_unregister_framework;
>   process::metrics::Counter messages_deactivate_framework;
>   process::metrics::Counter messages_kill_task;
>   process::metrics::Counter messages_status_update_acknowledgement;
>   process::metrics::Counter messages_resource_request;
>   process::metrics::Counter messages_launch_tasks;
>   process::metrics::Counter messages_decline_offers;
>   process::metrics::Counter messages_revive_offers;
>   process::metrics::Counter messages_suppress_offers;
>   process::metrics::Counter messages_reconcile_tasks;
>   process::metrics::Counter messages_framework_to_executor;
> {code}
> Now that we've introduced the Call/Event based API, we should have metrics 
> that reflect this. For example:
> {code}
> {
>   scheduler/calls: 100
>   scheduler/calls/decline: 90,
>   scheduler/calls/accept: 10,
>   scheduler/calls/accept/operations/create: 1,
>   scheduler/calls/accept/operations/destroy: 0,
>   scheduler/calls/accept/operations/launch: 4,
>   scheduler/calls/accept/operations/launch_group: 2,
>   scheduler/calls/accept/operations/reserve: 1,
>   scheduler/calls/accept/operations/unreserve: 0,
>   scheduler/calls/kill: 0,
>   // etc
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1280) Add replace task primitive

2016-12-16 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755230#comment-15755230
 ] 

Zhitao Li commented on MESOS-1280:
--

Hi, is there any common interest in pursuing this in the next 1-2 Mesos release 
cycles? Our organization is quite interested in adding this capability for a 
couple of reasons, and would be happy if some committer is willing to shepherd 
us.

Thanks!

> Add replace task primitive
> --
>
> Key: MESOS-1280
> URL: https://issues.apache.org/jira/browse/MESOS-1280
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, c++ api, master
>Reporter: Niklas Quarfot Nielsen
>  Labels: mesosphere
>
> Also along the lines of MESOS-938, replaceTask would one of a couple of 
> primitives needed to support various task replacement and scaling scenarios. 
> This replaceTask() version is significantly simpler than the first proposed 
> one; it's only responsibility is to run a new task info on a running tasks 
> resources.
> The running task will be killed as usual, but the newly freed resources will 
> never be announced and the new task will run on them instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6808) Refactor Docker::run to only take docker cli parameters

2016-12-16 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6808:


 Summary: Refactor Docker::run to only take docker cli parameters
 Key: MESOS-6808
 URL: https://issues.apache.org/jira/browse/MESOS-6808
 Project: Mesos
  Issue Type: Task
  Components: docker
Reporter: Zhitao Li
Assignee: Zhitao Li
Priority: Minor


As we discussed, {{Docker::run}} in src/docker/docker.hpp should only 
understand docker cli options. The logic of creating these options should be 
refactored to another helper function.

This will also allow us to overcome the maximum 10 argument limit of GMOCK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6831) Add metrics for `slave` libprocess' event queue

2016-12-21 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6831:


 Summary: Add metrics for `slave` libprocess' event queue
 Key: MESOS-6831
 URL: https://issues.apache.org/jira/browse/MESOS-6831
 Project: Mesos
  Issue Type: Improvement
  Components: agent
Reporter: Zhitao Li


We have event queue metrics for master and allocator in 
http://mesos.apache.org/documentation/latest/monitoring/, but we don't have the 
event queue length for the most important libprocess actor in agent `slave`.

I propose we add similar metrics to this actor. This is at least useful in 
debugging the issues of whether  Mesos agent is overloaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >