[jira] [Created] (MESOS-9207) CFS on docker executor tasks doesnt work

2018-09-04 Thread Martin Bydzovsky (JIRA)
Martin Bydzovsky created MESOS-9207:
---

 Summary: CFS on docker executor tasks doesnt work
 Key: MESOS-9207
 URL: https://issues.apache.org/jira/browse/MESOS-9207
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.6.1, 1.5.1
Reporter: Martin Bydzovsky


The CFS hardlimiting on docker-based tasks doesnt work. the 
--cgroups-enable-cfs support added in 
[https://github.com/apache/mesos/commit/346cc8dd528a28a6e1f1cbdb4c95b8bdea2f6070]
 adds parameter --cpu-quota, which is nice, however completely useless. The 
hardlimitting must be activated by setting either one of --cpus or --cpu-period 
and (optionally overriding some default) --cpu-quota. 
(https://docs.docker.com/config/containers/resource_constraints/#configure-the-default-cfs-scheduler)

 

Attaching output showing wrong parameters are added by the executor:

 
{code:java}
bydga@bydzovskym ~ λ curl http://mesos-slave1:5051/flags | jshon | grep cfs
 "cgroups_enable_cfs": "true",
bydga@bydzovskym ~ λ ssh mesos-slave1
Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1060-aws x86_64)
bydzovskym mesos-slave1:us-w2 ~  ps aux | grep example-api
root 30414 0.1 0.3 843532 49296 ? Ssl 07:54 0:01 mesos-docker-executor 
--cgroups_enable_cfs=true 
--container=mesos-6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 --docker=docker 
--docker_socket=/var/run/docker.sock --help=false 
--initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos 
--logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox 
--quiet=false 
--sandbox_directory=/srv/mesos/slaves/6b8f88fb-29df-4a35-86c3-a369d1447a53-S0/frameworks/2da5f61c-8400-40e0-8964-3edbd2f24e37-0001/executors/hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce/runs/6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088
 --stop_timeout=30secs
root 30426 0.0 0.1 324744 26644 ? Sl 07:54 0:00 docker -H 
unix:///var/run/docker.sock run --cpu-shares 1024 --cpu-quota 10 --memory 
209715200 -e HOST=mesos-slave1.priv -e 
MARATHON_APP_DOCKER_IMAGE=awsid.dkr.ecr.us-west-2.amazonaws.com/hera/example-api/production:d07dd097
 -e MARATHON_APP_ID=/hera/example-api/production/api -e 
MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_DISK=0.0 -e 
MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_RESOURCE_MEM=200.0 -e 
MARATHON_APP_VERSION=2018-09-04T07:54:09.419Z -e 
MESOS_CONTAINER_NAME=mesos-6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 -e 
MESOS_SANDBOX=/mnt/mesos/sandbox -e 
MESOS_TASK_ID=hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce
 -e PORT=9115 -e PORT0=9115 -e PORTS=9115 -e PORT_9115=9115 -e PORT_PORT0=9115 
-v 
/srv/mesos/slaves/6b8f88fb-29df-4a35-86c3-a369d1447a53-S0/frameworks/2da5f61c-8400-40e0-8964-3edbd2f24e37-0001/executors/hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce/runs/6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088:/mnt/mesos/sandbox
 --net bridge -p 9115:9115/tcp --name 
mesos-6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 
--label=MESOS_TASK_ID=hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce
 awsid.dkr.ecr.us-west-2.amazonaws.com/hera/example-api/production:d07dd097 
coffee index.coffee{code}
 

You can see, that the mesos-docker-executor has correctly propagated the
{code:java}
--cgroups_enable_cfs=true{code}
However 
{code:java}
--cpu-shares 1024 --cpu-quota 10{code}
are set in the docker run command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8982) add cgroup memory.max_usage_in_bytes into slave monitor/statistics endpoint

2018-06-06 Thread Martin Bydzovsky (JIRA)
Martin Bydzovsky created MESOS-8982:
---

 Summary: add cgroup memory.max_usage_in_bytes into slave 
monitor/statistics endpoint
 Key: MESOS-8982
 URL: https://issues.apache.org/jira/browse/MESOS-8982
 Project: Mesos
  Issue Type: Improvement
  Components: cgroups, docker, HTTP API
Affects Versions: 1.6.0
Reporter: Martin Bydzovsky


As an operator, I'm periodically checking slave's monitor/statistics endpoint 
to get the memory/cpu usage/cpu throttle for each running task. However, if 
there is a short-term memory usage peak (lets say seconds), I might miss it 
(the memory might have been allocated and also released in between my 2 
collect-metrics intervals). Since the max used memory is logged in the 
`/sys/fs/cgroup/memory/docker/CID/memory.max_usage_in_bytes`, it would be 
great, if this info would have been exposed in the api as well..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7522) Mesos containerizer to support docker credential helpers for private docker registries

2017-10-31 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226466#comment-16226466
 ] 

Martin Bydzovsky commented on MESOS-7522:
-

+1 for this. Specifying creds for pulling image as `credential 
principal+secret` in mesos containerizer is a no-go for AWS ECR. They issue you 
a token (running `aws ecr get-login`) which is valid for something like 12 
hours and then you need to obtain a new token.. Or is there a workaround for 
this?

> Mesos containerizer to support docker credential helpers for private docker 
> registries
> --
>
> Key: MESOS-7522
> URL: https://issues.apache.org/jira/browse/MESOS-7522
> Project: Mesos
>  Issue Type: Wish
>  Components: containerization
>Reporter: Mao Geng
>Assignee: Mao Geng
>  Labels: mesos-containerizer
>
> In Pinterest, we use Amazon ECR as our docker registry and use 
> https://github.com/awslabs/amazon-ecr-credential-helper to let docker engine 
> to get auth token automatically. 
> It works well with docker containerizer, as long as I have the 
> .docker/config.json configured "credStores" and --docker_config configured 
> for mesos-agent. 
> However, this doesn't work for mesos containerizer. Meanwhile we want to use 
> mesos containerizer's GPU support, so we have to run a separate docker 
> registry on http and without auth, purely for mesos containerizer. 
> I think it will be good if mesos containerizer can support 
> https://github.com/docker/docker-credential-helpers by default, so that it 
> will address a pain point for the users who are using crendential helpers 
> with private registries on ECR, GCR, quay, dockerhub etc. 
> This might be related to MESOS-7088
> CC [~jieyu] [~gilbert]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4279) Docker executor truncates task's output when the task is killed.

2016-06-07 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318411#comment-15318411
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hi Ben, nice to hear that. Just to be sure: https://reviews.apache.org/r/46892/ 
- this is the RB issue. 

I wanted to try one last help according to the things mentioned in the RB, but 
yesterday I pulled the mesos sources, compiled and I'm unable to run mesos 
1.0.0 with marathon 1.1.1 as I keep getting "Mesos JAR version 0.28.0 is not 
backwards compatible with Mesos native library version 1.0.0". 

So I guess now its up to you. The solution is pointed out on my github, maybe 
you just need to make it a bit nicer :)



> Docker executor truncates task's output when the task is killed.
> 
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2, 0.28.1
>Reporter: Martin Bydzovsky
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: docker, mesosphere
> Fix For: 1.0.0
>
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4279) Graceful restart of docker task

2016-04-20 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249489#comment-15249489
 ] 

Martin Bydzovsky edited comment on MESOS-4279 at 4/20/16 8:31 AM:
--

Fine, so I will prepare the RB issues. Should I make two separate? Or just one 
fixing both problems? Or just the one fixing the corrupted stdout/err streams? 
Because in the second one, theres the philosophical issue about marking task as 
KILLED vs FINISHED when it ends during the grace period (even though non-docker 
tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]?


was (Author: bydga):
Fine, so I will prepare the RB issues. Should I make two separate? Or just one 
fixing both problems? Or just the one fixing the corrupted stdout/err streams? 
Because in the second one, theres the philosophical issue about marking task as 
KILLED vs FINISHED when it end during the grace period (even though non-docker 
tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-20 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249489#comment-15249489
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Fine, so I will prepare the RB issues. Should I make two separate? Or just one 
fixing both problems? Or just the one fixing the corrupted stdout/err streams? 
Because in the second one, theres the philosophical issue about marking task as 
KILLED vs FINISHED when it end during the grace period (even though non-docker 
tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245784#comment-15245784
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Its usually not my job description to make videos... :) But I knew I should 
have added Mission Impossible theme song to the background to make it more 
dramatic and attractive!

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245684#comment-15245684
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Just to make it transparent, I've exchanged a few emails with Alex to clarify 
and summarize the issues. Today I confirmed both issues still persists on mesos 
0.29 + marathon 1.1.1. Here's video demonstrating both of them: 
https://www.youtube.com/watch?v=vDUA9_ASYW0. 

Btw, don't forget on my github, I've already fixed both problems there... ;)

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-15 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242587#comment-15242587
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

I'm happy I'm not the only one noticing this. Well to be honest, I absolutely 
gave up on solving this issue. I bugreported here, I even resolved the issues 
(there were more problems actually) in my branch on github, I spoke with some 
guy (from mesos) on the IRC, I was asking on Slack - without any response. And 
noone cares. In the meantime, they do something like: 
https://issues.apache.org/jira/browse/MESOS-4909 - but with the same errors 
again... So right now we are considering to write our own {{Executor}}

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-03-09 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186839#comment-15186839
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hi again [~qianzhang]. So i finally managed to solve the issue. There are 
actually 2 bugs in the docker/executor.cpp:

1) First is related to the my findings about the error {{attach: stdout: write 
unix @: broken pipe}} - which i mentioned 2 months ago in comment 
https://issues.apache.org/jira/browse/MESOS-4279?focusedCommentId=15091797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15091797:
You are calling the {{run->discard}} method (which causes to close the 
stderr/stdout streams) too early - during the "stoping period" container can 
(and usually will) write something about the termination - but the stream is 
already invalid, so thats why the docker daemon complains - and thats why we 
haven't seen the message from my script (got 15... ending... Goodbye...) Here's 
commit that fixes it:
https://github.com/bydga/mesos/commit/73d1b3dc8605cf51163619c9e05c88666d926951


2) Second is about the actual wrong TASK_KILLED state:
Basically, you are just setting the {{killed=true}} flag always 
https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L227 - 
which is not true. Plus, 
https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L300 this 
check is also wrong - failed task is when "its not ready"? So here is commit 
that fixes it too:
https://github.com/bydga/mesos/commit/acf79781c04ad9309083dc39131e2c8305331431

I'm not a C++ developer, so the code might not the best - however it finally 
works.
So, what now: Do you want me to create one pull request that solves both? 2 
pull requests per each issue i mentioned? They are actually at the same 
code-lines, so 2 separate merge requests might lead to conflicts... Can you 
process this? I dont want to create new issue and go the same months-taking 
procedure again.

Btw, i have created the commits against the 0.26.0 tag - because we were 
talking about this version. Should i rebase it to master? Is there a chance it 
will get merged into current WIP 0.28?

Thanks for quick answers.

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-03-07 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183205#comment-15183205
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Are you sure [~qianzhang] you tried exactly {{vagrant up}} and then restart the 
app (marathon api/ui)? Because now I started digging and adding custom logs in 
the mesos codebase and recompile it around and around. And to me, the code 
seems like it had never ever worked.

https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L219 - Just 
immediately after calling the docker->stop (with correct value btw - as I've 
inspected) you set {{killed=true}} and then, in the {{reaped}} method (which 
gets called immediately, you check for the {{killed}} flag to send wrong 
TASK_KILLED status update: 
https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L281. 

Finally, 
https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L308 stops 
the whole driver - which im not sure yet what that really means - but if thats 
the parent process of the docker executor, then it will kill the {{docker run}} 
process in a cascade.

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-02-12 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144279#comment-15144279
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hello [~qianzhang], did you have time to check on this Vagrantfile?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-02-02 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128200#comment-15128200
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hi Qian again,

sorry, I was busy doing other things so now I finally got back to this. I 
created a Vagrantfile to reproduce the issue: 
https://gist.github.com/bydga/34df92f67ae03ca2ec4d

All you need to do it is to run {{vagrant up}} in a folder with this file. Then 
you can go for a coffee (or maybe ten coffees), as the mesos compilation takes 
ages..

After that you should end up with running VM with zk, mesos-master, slave and 
marathon running. And also with 2 apps running in mesos (standalone and 
dockerized). Then you can navigate to http://192.168.50.4:8080 and restart the 
apps (via the marathon UI). The standalone dies peacefully while the dockerized 
gets killed violently.

 

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-13 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095831#comment-15095831
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hi again,

So we finally managed to compile and deploy the Mesos 0.27 - and I still can't 
get this working. :/

It's always the same output - restarting app in marathon results in 
{code}
...
Iteration #29
Killing docker task
Shutting down
{code}


We have dockerized both - mesos-master and slave - so you can easily reproduce 
our setup like:
{code:title=mesos-slave|borderStyle=solid}
docker run -it --privileged -p 5051:5051 -v 
/var/run/docker.sock:/var/run/docker.sock falsecz/mesos:git-468b8ec-with-docker 
mesos-slave --master=10.141.141.10:5050 --containerizers=mesos,docker 
--docker_stop_timeout=10secs --isolation=cgroups/cpu,cgroups/mem 
--advertise_ip=10.141.141.10 --no-switch_user --hostname=10.141.141.10
{code}

{code:title=mesos-master|borderStyle=solid}
docker run -it --privileged -p 5050:5050 falsecz/mesos:git-468b8ec mesos-master 
--work_dir=/tmp --advertise_ip=10.141.141.10 —hostname=10.141.141.10
{code}

{code:title=marathon|borderStyle=solid}
./start --master 10.141.141.10:5050 --zk zk://localhost:2181/marathon
{code}

No rocket science - the simplest setup possible, but I really don't know how 
your setup could differ from this one.


> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-13 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095854#comment-15095854
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Well thats what I was doing all the time till now. Just the 0.27 I ran inside 
docker - so you can easily run the absolutely same example.

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-13 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096389#comment-15096389
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Just simple: NO, it doesn't work. :)

Now i created a completely fresh ububtu 14.04 server VM

installed vitrualbox machine from 
http://releases.ubuntu.com/14.04.3/ubuntu-14.04.3-server-amd64.iso

Then followed step by step guide to compile and run mesos from the actual git 
master branch (http://mesos.apache.org/gettingstarted/)

cloned the repo, did some apt-get install, git, openjdk, build-essentials, 
python, libxxx, blabla... (as mentioned in the guide)
then i did
{code}
$ cd mesos
$ ./bootstrap
$ mkdir build
$ cd build
$ ../configure
$ make
{code}

after everything compiled OK, I started the master and the slave:
{code}
./mesos-master.sh --work_dir=/tmp --advertise_ip=192.168.59.4 
—hostname=192.168.59.4
{code}

{code}
./mesos-slave.sh --master=192.168.59.4:5050 --containerizers=mesos,docker 
--docker_stop_timeout=10secs --isolation=cgroups/cpu,cgroups/mem 
--advertise_ip=192.168.59.4 --no-switch_user --hostname=192.168.59.4
{code}

then i started the "standalone" app - the previously mentioned python script in 
/tmp/script.py as 
{code:title=standalone.coffee}
request = require "request"
data =
args: ["/tmp/script.py"]
cpus: 0.1
mem: 256
instances: 1
id: "python-standalone"

request
method: "post"
url: "http://localhost:8080/v2/apps;
json: data
, (e, r, b) ->
console.log "err", e if e
{code}
with awesomely working grace restarts!

then i created python-docker app:
{code:title=docker.coffee}
request = require "request"
data =
args: ["./script.py"]
container:
type: "DOCKER"
docker:
image: "bydga/marathon-test-api"
cpus: 0.1
mem: 256
instances: 1
id: "python-docker"

request
method: "post"
url: "http://localhost:8080/v2/apps;
json: data
, (e, r, b) ->
console.log "err", e if e
{code}

and obviously - *NOT working*.

You can see that the state of the tasks differ - KILLED vs FINISHED. I would 
expect finished in every case. http://prntscr.com/9pm28x

See attached screenshot of the stdout outputs of both tasks: 
http://prntscr.com/9pm1ny

Here is also attached complete logs of master, slave and marathon run: 
https://gist.github.com/bydga/f8e907d4c59bbcab726e
Im getting really desperate now - can someone confirm that is able to reproduce 
the above?



> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: 

[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-11 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091797#comment-15091797
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Well, now i did {{tail -f /var/log/upstart/docker.log}} and this is the output 
during app restart:
{code}
INFO[0365] GET /v1.21/containers/bydga/marathon-test-api:latest/json
ERRO[0365] Handler for GET 
/v1.21/containers/bydga/marathon-test-api:latest/json returned error: no such 
id: bydga/marathon-test-api:latest
ERRO[0365] HTTP Errorerr=no such id: 
bydga/marathon-test-api:latest statusCode=404
INFO[0365] GET /v1.21/images/bydga/marathon-test-api:latest/json
INFO[0366] POST 
/v1.21/containers/create?name=mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.80d8dd5e-d2e1-4379-aa27-1672f6fb8c6e
WARN[0366] Your kernel does not support swap limit capabilities, memory limited 
without swap.
INFO[0366] GET 
/v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.80d8dd5e-d2e1-4379-aa27-1672f6fb8c6e/json
INFO[0366] POST 
/v1.21/containers/70966cf9826b8e6cb14f60e4b82940786f226be717e2ab4136289117b571a178/attach?stderr=1=1=1
INFO[0366] POST 
/v1.21/containers/70966cf9826b8e6cb14f60e4b82940786f226be717e2ab4136289117b571a178/start
INFO[0366] GET 
/v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.80d8dd5e-d2e1-4379-aa27-1672f6fb8c6e/json
INFO[0366] POST 
/v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.f83fa298-8ccb-4f2d-a215-1dbd0ed70789/stop?t=10
ERRO[0366] attach: stdout: write unix @: broken pipe
INFO[0369] GET 
/v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.f83fa298-8ccb-4f2d-a215-1dbd0ed70789/json
{code}



Maybe some docker related problem..

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-11 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091776#comment-15091776
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Thanks [~qianzhang] for your time. Well this is the moment when I started to 
feel bad. Exactly this config is not working for me.. :/ I've just started 
simple local mesos-marathon cluster with following config:
mesos-slave http://10.141.141.10:5051/state.json
{code:javascript}
{
attributes: {},
build_date: "2015-10-12 20:57:28",
build_time: 1444683448,
build_user: "root",
completed_frameworks: [],
flags: {
appc_store_dir: "/tmp/mesos/store/appc",
authenticatee: "crammd5",
cgroups_cpu_enable_pids_and_tids_count: "false",
cgroups_enable_cfs: "false",
cgroups_hierarchy: "/sys/fs/cgroup",
cgroups_limit_swap: "false",
cgroups_root: "mesos",
container_disk_watch_interval: "15secs",
containerizers: "docker,mesos",
default_role: "*",
disk_watch_interval: "1mins",
docker: "docker",
docker_kill_orphans: "true",
docker_remove_delay: "6hrs",
docker_socket: "/var/run/docker.sock",
docker_stop_timeout: "10secs",
enforce_container_disk_quota: "false",
executor_registration_timeout: "5mins",
executor_shutdown_grace_period: "5secs",
fetcher_cache_dir: "/tmp/mesos/fetch",
fetcher_cache_size: "2GB",
frameworks_home: "",
gc_delay: "1weeks",
gc_disk_headroom: "0.1",
hadoop_home: "",
help: "false",
hostname: "10.141.141.10",
hostname_lookup: "true",
image_provisioner_backend: "copy",
initialize_driver_logging: "true",
isolation: "posix/cpu,posix/mem",
launcher_dir: "/usr/libexec/mesos",
log_dir: "/var/log/mesos",
logbufsecs: "0",
logging_level: "INFO",
master: "zk://localhost:2181/mesos",
oversubscribed_resources_interval: "15secs",
perf_duration: "10secs",
perf_interval: "1mins",
port: "5051",
qos_correction_interval_min: "0ns",
quiet: "false",
recover: "reconnect",
recovery_timeout: "15mins",
registration_backoff_factor: "1secs",
resource_monitoring_interval: "1secs",
revocable_cpu_low_priority: "true",
sandbox_directory: "/mnt/mesos/sandbox",
strict: "true",
switch_user: "true",
systemd_runtime_directory: "/run/systemd/system",
version: "false",
work_dir: "/tmp/mesos"
},
git_sha: "2dd7f7ee115fe00b8e098b0a10762a4fa8f4600f",
git_tag: "0.25.0",
hostname: "10.141.141.10",
id: "35e27fef-76b9-43f5-921d-83574ded0405-S0",
log_dir: "/var/log/mesos",
master_hostname: "mesos.vm",
pid: "slave(1)@127.0.1.1:5051",
resources: {
cpus: 2,
disk: 34068,
mem: 1000,
ports: "[31000-32000]"
},
start_time: 1452510028.844,
version: "0.25.0"
}
{code}

mesos-master http://10.141.141.10:5050/state.json
{code:javascript}
{
activated_slaves: 1,
build_date: "2015-10-12 20:57:28",
build_time: 1444683448,
build_user: "root",
completed_frameworks: [],
deactivated_slaves: 0,
elected_time: 1452509876.02982,
flags: {
allocation_interval: "1secs",
allocator: "HierarchicalDRF",
authenticate: "false",
authenticate_slaves: "false",
authenticators: "crammd5",
authorizers: "local",
framework_sorter: "drf",
help: "false",
hostname_lookup: "true",
initialize_driver_logging: "true",
log_auto_initialize: "true",
log_dir: "/var/log/mesos",
logbufsecs: "0",
logging_level: "INFO",
max_slave_ping_timeouts: "5",
port: "5050",
quiet: "false",
quorum: "1",
recovery_slave_removal_limit: "100%",
registry: "replicated_log",
registry_fetch_timeout: "1mins",
registry_store_timeout: "5secs",
registry_strict: "false",
root_submissions: "true",

[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-11 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091884#comment-15091884
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hmm, you are using the master branch (0.27 WIP) - we need to compile this to 
check whether it works :) I've been trying it on 0.24.1, 0.25 and 0.26 - which 
are in the distribution repositories.

And another question - you use {{"containerizers": "mesos,docker"}} - that 
means the mesos containerization has higher priority and thus gets used? What's 
the advantage of this approach instead of preferring docker containerizer?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-08 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089364#comment-15089364
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Well I guess you introduced another "issue" in your test example. It's related 
to the way how you started the Marathon app. Please look at the explanation 
here: 
https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args. 
In your {{ps}} output, you can see that the actual command is {{/bin/sh -c 
python /app/script.py}} - wrapped by sh -c.

Seems like you started your Marathon app with something like: 
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd: 
"python script.py", ...} {code}
What I was showing in my examples above was: 
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args: 
["/tmp/script.py"], ...} {code}

Usually this is called a "PID 1 problem" - 
https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn.

Simply said, in your example the PID 1 inside the docker container is the shell 
process and the actual python script is pid 2. Default signal handlers for all 
processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal 
handlers just ignore them.

So you could retry the example and use args instead of cmd. Then your {{ps}} 
output should look like:
{code}
root 10738  0.0  0.0 218228 14236 ? 15:22   0:00 docker run -c 102 -m 
268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z 
-e HOST=mesos-slave1.example.com -e 
MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e 
MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e 
PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123 
-e MESOS_SANDBOX=/mnt/mesos/sandbox -v 
/srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox
 --net host --name 
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f
 bydga/marathon-test-api ./script.py
root 10749  0.0  0.0  21576  4336 ? 15:22   0:00 /usr/bin/python 
./script.py
{code}

With this setup, the docker stop works as expected:
{code}
bydzovskym mesos-slave1:aws ~   docker ps
CONTAINER IDIMAGE 
COMMAND  CREATED STATUS  PORTS  
 NAMES
ed4a35e4372cbydga/marathon-test-api   
"./script.py"7 minutes ago   Up 7 minutes   
 
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f

bydzovskym mesos-slave1:aws ~   time docker stop ed4a35e4372c
ed4a35e4372c

real0m2.184s
user0m0.016s
sys 0m0.042s
{code}
and the output of the dokcer:
{code}
bydzovskym mesos-slave1:aws ~   docker logs -f ed4a35e4372c
Hello
15:15:57.943294
Iteration #1
15:15:58.944470
Iteration #2
15:15:59.945631
Iteration #3
15:16:00.946794
got 15
15:16:40.473517
15:16:42.475655
ending
Goodbye
{code}

The docker stop took a liiitle more than 2 seconds - as the grace period in the 
python script.

I still guess the problem is somewhere in the mesos orchestrating the docker - 
either it sends wrong {{docker kill}} or it kills it even more painfully 
(killing the docker run with linux {{kill}} command...

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> 

[jira] [Updated] (MESOS-4279) Graceful restart of docker task

2016-01-04 Thread Martin Bydzovsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Bydzovsky updated MESOS-4279:

Description: 
I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
came to a following issue:

(it was already discussed on https://github.com/mesosphere/marathon/issues/2876 
and guys form mesosphere got to a point that its probably a docker 
containerizer problem...)
To sum it up:

When i deploy simple python script to all mesos-slaves:
{code}
#!/usr/bin/python

from time import sleep
import signal
import sys
import datetime

def sigterm_handler(_signo, _stack_frame):
print "got %i" % _signo
print datetime.datetime.now().time()
sys.stdout.flush()
sleep(2)
print datetime.datetime.now().time()
print "ending"
sys.stdout.flush()
sys.exit(0)

signal.signal(signal.SIGTERM, sigterm_handler)
signal.signal(signal.SIGINT, sigterm_handler)

try:
print "Hello"
i = 0
while True:
i += 1
print datetime.datetime.now().time()
print "Iteration #%i" % i
sys.stdout.flush()
sleep(1)
finally:
print "Goodbye"
{code}

and I run it through Marathon like
{code:javascript}
data = {
args: ["/tmp/script.py"],
instances: 1,
cpus: 0.1,
mem: 256,
id: "marathon-test-api"
}
{code}

During the app restart I get expected result - the task receives sigterm and 
dies peacefully (during my script-specified 2 seconds period)

But when i wrap this python script in a docker:
{code}
FROM node:4.2

RUN mkdir /app
ADD . /app
WORKDIR /app
ENTRYPOINT []
{code}
and run appropriate application by Marathon:
{code:javascript}
data = {
args: ["./script.py"],
container: {
type: "DOCKER",
docker: {
image: "bydga/marathon-test-api"
},
forcePullImage: yes
},
cpus: 0.1,
mem: 256,
instances: 1,
id: "marathon-test-api"
}
{code}

The task during restart (issued from marathon) dies immediately without having 
a chance to do any cleanup.


  was:
I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
came to a following issue:

(it was already discussed on https://github.com/mesosphere/marathon/issues/2876 
and guys form mesosphere got to a point that its probably docker containerizer 
problem...)
To sum it up:

When i deploy simple python script to all mesos-slaves:
{code}
#!/usr/bin/python

from time import sleep
import signal
import sys
import datetime

def sigterm_handler(_signo, _stack_frame):
print "got %i" % _signo
print datetime.datetime.now().time()
sys.stdout.flush()
sleep(2)
print datetime.datetime.now().time()
print "ending"
sys.stdout.flush()
sys.exit(0)

signal.signal(signal.SIGTERM, sigterm_handler)
signal.signal(signal.SIGINT, sigterm_handler)

try:
print "Hello"
i = 0
while True:
i += 1
print datetime.datetime.now().time()
print "Iteration #%i" % i
sys.stdout.flush()
sleep(1)
finally:
print "Goodbye"
{code}

and I run it through Marathon like
{code:javascript}
data = {
args: ["/tmp/script.py"],
instances: 1,
cpus: 0.1,
mem: 256,
id: "marathon-test-api"
}
{code}

During app restart I get expected result - task receives sigterm and dies 
peacefully (during my script-specified 2 seconds)

But when i wrap this python script in docker:
{code}
FROM node:4.2

RUN mkdir /app
ADD . /app
WORKDIR /app
ENTRYPOINT []
{code}
and run appropriate application by Marathon:
{code:javascript}
data = {
args: ["./script.py"],
container: {
type: "DOCKER",
docker: {
image: "bydga/marathon-test-api"
},
forcePullImage: yes
},
cpus: 0.1,
mem: 256,
instances: 1,
id: "marathon-test-api"
}
{code}

The task during restart (issued from marathon) dies immediately without a 
chance to do any cleanup.



> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import 

[jira] [Created] (MESOS-4279) Graceful restart of docker task

2016-01-04 Thread Martin Bydzovsky (JIRA)
Martin Bydzovsky created MESOS-4279:
---

 Summary: Graceful restart of docker task
 Key: MESOS-4279
 URL: https://issues.apache.org/jira/browse/MESOS-4279
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 0.25.0
Reporter: Martin Bydzovsky


I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
came to a following issue:

(it was already discussed on https://github.com/mesosphere/marathon/issues/2876 
and guys form mesosphere got to a point that its probably docker containerizer 
problem...)
To sum it up:

When i deploy simple python script to all mesos-slaves:
{code}
#!/usr/bin/python

from time import sleep
import signal
import sys
import datetime

def sigterm_handler(_signo, _stack_frame):
print "got %i" % _signo
print datetime.datetime.now().time()
sys.stdout.flush()
sleep(2)
print datetime.datetime.now().time()
print "ending"
sys.stdout.flush()
sys.exit(0)

signal.signal(signal.SIGTERM, sigterm_handler)
signal.signal(signal.SIGINT, sigterm_handler)

try:
print "Hello"
i = 0
while True:
i += 1
print datetime.datetime.now().time()
print "Iteration #%i" % i
sys.stdout.flush()
sleep(1)
finally:
print "Goodbye"
{code}

and I run it through Marathon like
{code:javascript}
data = {
args: ["/tmp/script.py"],
instances: 1,
cpus: 0.1,
mem: 256,
id: "marathon-test-api"
}
{code}

During app restart I get expected result - task receives sigterm and dies 
peacefully (during my script-specified 2 seconds)

But when i wrap this python script in docker:
{code}
FROM node:4.2

RUN mkdir /app
ADD . /app
WORKDIR /app
ENTRYPOINT []
{code}
and run appropriate application by Marathon:
{code:javascript}
data = {
args: ["./script.py"],
container: {
type: "DOCKER",
docker: {
image: "bydga/marathon-test-api"
},
forcePullImage: yes
},
cpus: 0.1,
mem: 256,
instances: 1,
id: "marathon-test-api"
}
{code}

The task during restart (issued from marathon) dies immediately without a 
chance to do any cleanup.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)