[jira] [Created] (MESOS-9207) CFS on docker executor tasks doesnt work
Martin Bydzovsky created MESOS-9207: --- Summary: CFS on docker executor tasks doesnt work Key: MESOS-9207 URL: https://issues.apache.org/jira/browse/MESOS-9207 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 1.6.1, 1.5.1 Reporter: Martin Bydzovsky The CFS hardlimiting on docker-based tasks doesnt work. the --cgroups-enable-cfs support added in [https://github.com/apache/mesos/commit/346cc8dd528a28a6e1f1cbdb4c95b8bdea2f6070] adds parameter --cpu-quota, which is nice, however completely useless. The hardlimitting must be activated by setting either one of --cpus or --cpu-period and (optionally overriding some default) --cpu-quota. (https://docs.docker.com/config/containers/resource_constraints/#configure-the-default-cfs-scheduler) Attaching output showing wrong parameters are added by the executor: {code:java} bydga@bydzovskym ~ λ curl http://mesos-slave1:5051/flags | jshon | grep cfs "cgroups_enable_cfs": "true", bydga@bydzovskym ~ λ ssh mesos-slave1 Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1060-aws x86_64) bydzovskym mesos-slave1:us-w2 ~ ps aux | grep example-api root 30414 0.1 0.3 843532 49296 ? Ssl 07:54 0:01 mesos-docker-executor --cgroups_enable_cfs=true --container=mesos-6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 --docker=docker --docker_socket=/var/run/docker.sock --help=false --initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos --logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox --quiet=false --sandbox_directory=/srv/mesos/slaves/6b8f88fb-29df-4a35-86c3-a369d1447a53-S0/frameworks/2da5f61c-8400-40e0-8964-3edbd2f24e37-0001/executors/hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce/runs/6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 --stop_timeout=30secs root 30426 0.0 0.1 324744 26644 ? Sl 07:54 0:00 docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --cpu-quota 10 --memory 209715200 -e HOST=mesos-slave1.priv -e MARATHON_APP_DOCKER_IMAGE=awsid.dkr.ecr.us-west-2.amazonaws.com/hera/example-api/production:d07dd097 -e MARATHON_APP_ID=/hera/example-api/production/api -e MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_RESOURCE_MEM=200.0 -e MARATHON_APP_VERSION=2018-09-04T07:54:09.419Z -e MESOS_CONTAINER_NAME=mesos-6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_TASK_ID=hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce -e PORT=9115 -e PORT0=9115 -e PORTS=9115 -e PORT_9115=9115 -e PORT_PORT0=9115 -v /srv/mesos/slaves/6b8f88fb-29df-4a35-86c3-a369d1447a53-S0/frameworks/2da5f61c-8400-40e0-8964-3edbd2f24e37-0001/executors/hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce/runs/6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088:/mnt/mesos/sandbox --net bridge -p 9115:9115/tcp --name mesos-6e31d2cb-ac4f-4b1c-ae2b-08cf54acc088 --label=MESOS_TASK_ID=hera_example-api_production_api.b4ff812e-b017-11e8-92cc-06cd01d45cce awsid.dkr.ecr.us-west-2.amazonaws.com/hera/example-api/production:d07dd097 coffee index.coffee{code} You can see, that the mesos-docker-executor has correctly propagated the {code:java} --cgroups_enable_cfs=true{code} However {code:java} --cpu-shares 1024 --cpu-quota 10{code} are set in the docker run command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8982) add cgroup memory.max_usage_in_bytes into slave monitor/statistics endpoint
Martin Bydzovsky created MESOS-8982: --- Summary: add cgroup memory.max_usage_in_bytes into slave monitor/statistics endpoint Key: MESOS-8982 URL: https://issues.apache.org/jira/browse/MESOS-8982 Project: Mesos Issue Type: Improvement Components: cgroups, docker, HTTP API Affects Versions: 1.6.0 Reporter: Martin Bydzovsky As an operator, I'm periodically checking slave's monitor/statistics endpoint to get the memory/cpu usage/cpu throttle for each running task. However, if there is a short-term memory usage peak (lets say seconds), I might miss it (the memory might have been allocated and also released in between my 2 collect-metrics intervals). Since the max used memory is logged in the `/sys/fs/cgroup/memory/docker/CID/memory.max_usage_in_bytes`, it would be great, if this info would have been exposed in the api as well.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7522) Mesos containerizer to support docker credential helpers for private docker registries
[ https://issues.apache.org/jira/browse/MESOS-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226466#comment-16226466 ] Martin Bydzovsky commented on MESOS-7522: - +1 for this. Specifying creds for pulling image as `credential principal+secret` in mesos containerizer is a no-go for AWS ECR. They issue you a token (running `aws ecr get-login`) which is valid for something like 12 hours and then you need to obtain a new token.. Or is there a workaround for this? > Mesos containerizer to support docker credential helpers for private docker > registries > -- > > Key: MESOS-7522 > URL: https://issues.apache.org/jira/browse/MESOS-7522 > Project: Mesos > Issue Type: Wish > Components: containerization >Reporter: Mao Geng >Assignee: Mao Geng > Labels: mesos-containerizer > > In Pinterest, we use Amazon ECR as our docker registry and use > https://github.com/awslabs/amazon-ecr-credential-helper to let docker engine > to get auth token automatically. > It works well with docker containerizer, as long as I have the > .docker/config.json configured "credStores" and --docker_config configured > for mesos-agent. > However, this doesn't work for mesos containerizer. Meanwhile we want to use > mesos containerizer's GPU support, so we have to run a separate docker > registry on http and without auth, purely for mesos containerizer. > I think it will be good if mesos containerizer can support > https://github.com/docker/docker-credential-helpers by default, so that it > will address a pain point for the users who are using crendential helpers > with private registries on ECR, GCR, quay, dockerhub etc. > This might be related to MESOS-7088 > CC [~jieyu] [~gilbert] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-4279) Docker executor truncates task's output when the task is killed.
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318411#comment-15318411 ] Martin Bydzovsky commented on MESOS-4279: - Hi Ben, nice to hear that. Just to be sure: https://reviews.apache.org/r/46892/ - this is the RB issue. I wanted to try one last help according to the things mentioned in the RB, but yesterday I pulled the mesos sources, compiled and I'm unable to run mesos 1.0.0 with marathon 1.1.1 as I keep getting "Mesos JAR version 0.28.0 is not backwards compatible with Mesos native library version 1.0.0". So I guess now its up to you. The solution is pointed out on my github, maybe you just need to make it a bit nicer :) > Docker executor truncates task's output when the task is killed. > > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2, 0.28.1 >Reporter: Martin Bydzovsky >Assignee: Benjamin Mahler >Priority: Critical > Labels: docker, mesosphere > Fix For: 1.0.0 > > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249489#comment-15249489 ] Martin Bydzovsky edited comment on MESOS-4279 at 4/20/16 8:31 AM: -- Fine, so I will prepare the RB issues. Should I make two separate? Or just one fixing both problems? Or just the one fixing the corrupted stdout/err streams? Because in the second one, theres the philosophical issue about marking task as KILLED vs FINISHED when it ends during the grace period (even though non-docker tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]? was (Author: bydga): Fine, so I will prepare the RB issues. Should I make two separate? Or just one fixing both problems? Or just the one fixing the corrupted stdout/err streams? Because in the second one, theres the philosophical issue about marking task as KILLED vs FINISHED when it end during the grace period (even though non-docker tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > Labels: docker, mesosphere > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249489#comment-15249489 ] Martin Bydzovsky commented on MESOS-4279: - Fine, so I will prepare the RB issues. Should I make two separate? Or just one fixing both problems? Or just the one fixing the corrupted stdout/err streams? Because in the second one, theres the philosophical issue about marking task as KILLED vs FINISHED when it end during the grace period (even though non-docker tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > Labels: docker, mesosphere > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245784#comment-15245784 ] Martin Bydzovsky commented on MESOS-4279: - Its usually not my job description to make videos... :) But I knew I should have added Mission Impossible theme song to the background to make it more dramatic and attractive! > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > Labels: docker, mesosphere > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245684#comment-15245684 ] Martin Bydzovsky commented on MESOS-4279: - Just to make it transparent, I've exchanged a few emails with Alex to clarify and summarize the issues. Today I confirmed both issues still persists on mesos 0.29 + marathon 1.1.1. Here's video demonstrating both of them: https://www.youtube.com/watch?v=vDUA9_ASYW0. Btw, don't forget on my github, I've already fixed both problems there... ;) > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > Labels: docker, mesosphere > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242587#comment-15242587 ] Martin Bydzovsky commented on MESOS-4279: - I'm happy I'm not the only one noticing this. Well to be honest, I absolutely gave up on solving this issue. I bugreported here, I even resolved the issues (there were more problems actually) in my branch on github, I spoke with some guy (from mesos) on the IRC, I was asking on Slack - without any response. And noone cares. In the meantime, they do something like: https://issues.apache.org/jira/browse/MESOS-4909 - but with the same errors again... So right now we are considering to write our own {{Executor}} > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186839#comment-15186839 ] Martin Bydzovsky commented on MESOS-4279: - Hi again [~qianzhang]. So i finally managed to solve the issue. There are actually 2 bugs in the docker/executor.cpp: 1) First is related to the my findings about the error {{attach: stdout: write unix @: broken pipe}} - which i mentioned 2 months ago in comment https://issues.apache.org/jira/browse/MESOS-4279?focusedCommentId=15091797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15091797: You are calling the {{run->discard}} method (which causes to close the stderr/stdout streams) too early - during the "stoping period" container can (and usually will) write something about the termination - but the stream is already invalid, so thats why the docker daemon complains - and thats why we haven't seen the message from my script (got 15... ending... Goodbye...) Here's commit that fixes it: https://github.com/bydga/mesos/commit/73d1b3dc8605cf51163619c9e05c88666d926951 2) Second is about the actual wrong TASK_KILLED state: Basically, you are just setting the {{killed=true}} flag always https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L227 - which is not true. Plus, https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L300 this check is also wrong - failed task is when "its not ready"? So here is commit that fixes it too: https://github.com/bydga/mesos/commit/acf79781c04ad9309083dc39131e2c8305331431 I'm not a C++ developer, so the code might not the best - however it finally works. So, what now: Do you want me to create one pull request that solves both? 2 pull requests per each issue i mentioned? They are actually at the same code-lines, so 2 separate merge requests might lead to conflicts... Can you process this? I dont want to create new issue and go the same months-taking procedure again. Btw, i have created the commits against the 0.26.0 tag - because we were talking about this version. Should i rebase it to master? Is there a chance it will get merged into current WIP 0.28? Thanks for quick answers. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183205#comment-15183205 ] Martin Bydzovsky commented on MESOS-4279: - Are you sure [~qianzhang] you tried exactly {{vagrant up}} and then restart the app (marathon api/ui)? Because now I started digging and adding custom logs in the mesos codebase and recompile it around and around. And to me, the code seems like it had never ever worked. https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L219 - Just immediately after calling the docker->stop (with correct value btw - as I've inspected) you set {{killed=true}} and then, in the {{reaped}} method (which gets called immediately, you check for the {{killed}} flag to send wrong TASK_KILLED status update: https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L281. Finally, https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L308 stops the whole driver - which im not sure yet what that really means - but if thats the parent process of the docker executor, then it will kill the {{docker run}} process in a cascade. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144279#comment-15144279 ] Martin Bydzovsky commented on MESOS-4279: - Hello [~qianzhang], did you have time to check on this Vagrantfile? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128200#comment-15128200 ] Martin Bydzovsky commented on MESOS-4279: - Hi Qian again, sorry, I was busy doing other things so now I finally got back to this. I created a Vagrantfile to reproduce the issue: https://gist.github.com/bydga/34df92f67ae03ca2ec4d All you need to do it is to run {{vagrant up}} in a folder with this file. Then you can go for a coffee (or maybe ten coffees), as the mesos compilation takes ages.. After that you should end up with running VM with zk, mesos-master, slave and marathon running. And also with 2 apps running in mesos (standalone and dockerized). Then you can navigate to http://192.168.50.4:8080 and restart the apps (via the marathon UI). The standalone dies peacefully while the dockerized gets killed violently. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095831#comment-15095831 ] Martin Bydzovsky commented on MESOS-4279: - Hi again, So we finally managed to compile and deploy the Mesos 0.27 - and I still can't get this working. :/ It's always the same output - restarting app in marathon results in {code} ... Iteration #29 Killing docker task Shutting down {code} We have dockerized both - mesos-master and slave - so you can easily reproduce our setup like: {code:title=mesos-slave|borderStyle=solid} docker run -it --privileged -p 5051:5051 -v /var/run/docker.sock:/var/run/docker.sock falsecz/mesos:git-468b8ec-with-docker mesos-slave --master=10.141.141.10:5050 --containerizers=mesos,docker --docker_stop_timeout=10secs --isolation=cgroups/cpu,cgroups/mem --advertise_ip=10.141.141.10 --no-switch_user --hostname=10.141.141.10 {code} {code:title=mesos-master|borderStyle=solid} docker run -it --privileged -p 5050:5050 falsecz/mesos:git-468b8ec mesos-master --work_dir=/tmp --advertise_ip=10.141.141.10 —hostname=10.141.141.10 {code} {code:title=marathon|borderStyle=solid} ./start --master 10.141.141.10:5050 --zk zk://localhost:2181/marathon {code} No rocket science - the simplest setup possible, but I really don't know how your setup could differ from this one. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095854#comment-15095854 ] Martin Bydzovsky commented on MESOS-4279: - Well thats what I was doing all the time till now. Just the 0.27 I ran inside docker - so you can easily run the absolutely same example. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096389#comment-15096389 ] Martin Bydzovsky commented on MESOS-4279: - Just simple: NO, it doesn't work. :) Now i created a completely fresh ububtu 14.04 server VM installed vitrualbox machine from http://releases.ubuntu.com/14.04.3/ubuntu-14.04.3-server-amd64.iso Then followed step by step guide to compile and run mesos from the actual git master branch (http://mesos.apache.org/gettingstarted/) cloned the repo, did some apt-get install, git, openjdk, build-essentials, python, libxxx, blabla... (as mentioned in the guide) then i did {code} $ cd mesos $ ./bootstrap $ mkdir build $ cd build $ ../configure $ make {code} after everything compiled OK, I started the master and the slave: {code} ./mesos-master.sh --work_dir=/tmp --advertise_ip=192.168.59.4 —hostname=192.168.59.4 {code} {code} ./mesos-slave.sh --master=192.168.59.4:5050 --containerizers=mesos,docker --docker_stop_timeout=10secs --isolation=cgroups/cpu,cgroups/mem --advertise_ip=192.168.59.4 --no-switch_user --hostname=192.168.59.4 {code} then i started the "standalone" app - the previously mentioned python script in /tmp/script.py as {code:title=standalone.coffee} request = require "request" data = args: ["/tmp/script.py"] cpus: 0.1 mem: 256 instances: 1 id: "python-standalone" request method: "post" url: "http://localhost:8080/v2/apps; json: data , (e, r, b) -> console.log "err", e if e {code} with awesomely working grace restarts! then i created python-docker app: {code:title=docker.coffee} request = require "request" data = args: ["./script.py"] container: type: "DOCKER" docker: image: "bydga/marathon-test-api" cpus: 0.1 mem: 256 instances: 1 id: "python-docker" request method: "post" url: "http://localhost:8080/v2/apps; json: data , (e, r, b) -> console.log "err", e if e {code} and obviously - *NOT working*. You can see that the state of the tasks differ - KILLED vs FINISHED. I would expect finished in every case. http://prntscr.com/9pm28x See attached screenshot of the stdout outputs of both tasks: http://prntscr.com/9pm1ny Here is also attached complete logs of master, slave and marathon run: https://gist.github.com/bydga/f8e907d4c59bbcab726e Im getting really desperate now - can someone confirm that is able to reproduce the above? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id:
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091797#comment-15091797 ] Martin Bydzovsky commented on MESOS-4279: - Well, now i did {{tail -f /var/log/upstart/docker.log}} and this is the output during app restart: {code} INFO[0365] GET /v1.21/containers/bydga/marathon-test-api:latest/json ERRO[0365] Handler for GET /v1.21/containers/bydga/marathon-test-api:latest/json returned error: no such id: bydga/marathon-test-api:latest ERRO[0365] HTTP Errorerr=no such id: bydga/marathon-test-api:latest statusCode=404 INFO[0365] GET /v1.21/images/bydga/marathon-test-api:latest/json INFO[0366] POST /v1.21/containers/create?name=mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.80d8dd5e-d2e1-4379-aa27-1672f6fb8c6e WARN[0366] Your kernel does not support swap limit capabilities, memory limited without swap. INFO[0366] GET /v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.80d8dd5e-d2e1-4379-aa27-1672f6fb8c6e/json INFO[0366] POST /v1.21/containers/70966cf9826b8e6cb14f60e4b82940786f226be717e2ab4136289117b571a178/attach?stderr=1=1=1 INFO[0366] POST /v1.21/containers/70966cf9826b8e6cb14f60e4b82940786f226be717e2ab4136289117b571a178/start INFO[0366] GET /v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.80d8dd5e-d2e1-4379-aa27-1672f6fb8c6e/json INFO[0366] POST /v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.f83fa298-8ccb-4f2d-a215-1dbd0ed70789/stop?t=10 ERRO[0366] attach: stdout: write unix @: broken pipe INFO[0369] GET /v1.21/containers/mesos-35e27fef-76b9-43f5-921d-83574ded0405-S0.f83fa298-8ccb-4f2d-a215-1dbd0ed70789/json {code} Maybe some docker related problem.. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091776#comment-15091776 ] Martin Bydzovsky commented on MESOS-4279: - Thanks [~qianzhang] for your time. Well this is the moment when I started to feel bad. Exactly this config is not working for me.. :/ I've just started simple local mesos-marathon cluster with following config: mesos-slave http://10.141.141.10:5051/state.json {code:javascript} { attributes: {}, build_date: "2015-10-12 20:57:28", build_time: 1444683448, build_user: "root", completed_frameworks: [], flags: { appc_store_dir: "/tmp/mesos/store/appc", authenticatee: "crammd5", cgroups_cpu_enable_pids_and_tids_count: "false", cgroups_enable_cfs: "false", cgroups_hierarchy: "/sys/fs/cgroup", cgroups_limit_swap: "false", cgroups_root: "mesos", container_disk_watch_interval: "15secs", containerizers: "docker,mesos", default_role: "*", disk_watch_interval: "1mins", docker: "docker", docker_kill_orphans: "true", docker_remove_delay: "6hrs", docker_socket: "/var/run/docker.sock", docker_stop_timeout: "10secs", enforce_container_disk_quota: "false", executor_registration_timeout: "5mins", executor_shutdown_grace_period: "5secs", fetcher_cache_dir: "/tmp/mesos/fetch", fetcher_cache_size: "2GB", frameworks_home: "", gc_delay: "1weeks", gc_disk_headroom: "0.1", hadoop_home: "", help: "false", hostname: "10.141.141.10", hostname_lookup: "true", image_provisioner_backend: "copy", initialize_driver_logging: "true", isolation: "posix/cpu,posix/mem", launcher_dir: "/usr/libexec/mesos", log_dir: "/var/log/mesos", logbufsecs: "0", logging_level: "INFO", master: "zk://localhost:2181/mesos", oversubscribed_resources_interval: "15secs", perf_duration: "10secs", perf_interval: "1mins", port: "5051", qos_correction_interval_min: "0ns", quiet: "false", recover: "reconnect", recovery_timeout: "15mins", registration_backoff_factor: "1secs", resource_monitoring_interval: "1secs", revocable_cpu_low_priority: "true", sandbox_directory: "/mnt/mesos/sandbox", strict: "true", switch_user: "true", systemd_runtime_directory: "/run/systemd/system", version: "false", work_dir: "/tmp/mesos" }, git_sha: "2dd7f7ee115fe00b8e098b0a10762a4fa8f4600f", git_tag: "0.25.0", hostname: "10.141.141.10", id: "35e27fef-76b9-43f5-921d-83574ded0405-S0", log_dir: "/var/log/mesos", master_hostname: "mesos.vm", pid: "slave(1)@127.0.1.1:5051", resources: { cpus: 2, disk: 34068, mem: 1000, ports: "[31000-32000]" }, start_time: 1452510028.844, version: "0.25.0" } {code} mesos-master http://10.141.141.10:5050/state.json {code:javascript} { activated_slaves: 1, build_date: "2015-10-12 20:57:28", build_time: 1444683448, build_user: "root", completed_frameworks: [], deactivated_slaves: 0, elected_time: 1452509876.02982, flags: { allocation_interval: "1secs", allocator: "HierarchicalDRF", authenticate: "false", authenticate_slaves: "false", authenticators: "crammd5", authorizers: "local", framework_sorter: "drf", help: "false", hostname_lookup: "true", initialize_driver_logging: "true", log_auto_initialize: "true", log_dir: "/var/log/mesos", logbufsecs: "0", logging_level: "INFO", max_slave_ping_timeouts: "5", port: "5050", quiet: "false", quorum: "1", recovery_slave_removal_limit: "100%", registry: "replicated_log", registry_fetch_timeout: "1mins", registry_store_timeout: "5secs", registry_strict: "false", root_submissions: "true",
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091884#comment-15091884 ] Martin Bydzovsky commented on MESOS-4279: - Hmm, you are using the master branch (0.27 WIP) - we need to compile this to check whether it works :) I've been trying it on 0.24.1, 0.25 and 0.26 - which are in the distribution repositories. And another question - you use {{"containerizers": "mesos,docker"}} - that means the mesos containerization has higher priority and thus gets used? What's the advantage of this approach instead of preferring docker containerizer? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089364#comment-15089364 ] Martin Bydzovsky commented on MESOS-4279: - Well I guess you introduced another "issue" in your test example. It's related to the way how you started the Marathon app. Please look at the explanation here: https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args. In your {{ps}} output, you can see that the actual command is {{/bin/sh -c python /app/script.py}} - wrapped by sh -c. Seems like you started your Marathon app with something like: {code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd: "python script.py", ...} {code} What I was showing in my examples above was: {code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args: ["/tmp/script.py"], ...} {code} Usually this is called a "PID 1 problem" - https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn. Simply said, in your example the PID 1 inside the docker container is the shell process and the actual python script is pid 2. Default signal handlers for all processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal handlers just ignore them. So you could retry the example and use args instead of cmd. Then your {{ps}} output should look like: {code} root 10738 0.0 0.0 218228 14236 ? 15:22 0:00 docker run -c 102 -m 268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z -e HOST=mesos-slave1.example.com -e MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123 -e MESOS_SANDBOX=/mnt/mesos/sandbox -v /srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox --net host --name mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f bydga/marathon-test-api ./script.py root 10749 0.0 0.0 21576 4336 ? 15:22 0:00 /usr/bin/python ./script.py {code} With this setup, the docker stop works as expected: {code} bydzovskym mesos-slave1:aws ~ docker ps CONTAINER IDIMAGE COMMAND CREATED STATUS PORTS NAMES ed4a35e4372cbydga/marathon-test-api "./script.py"7 minutes ago Up 7 minutes mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f bydzovskym mesos-slave1:aws ~ time docker stop ed4a35e4372c ed4a35e4372c real0m2.184s user0m0.016s sys 0m0.042s {code} and the output of the dokcer: {code} bydzovskym mesos-slave1:aws ~ docker logs -f ed4a35e4372c Hello 15:15:57.943294 Iteration #1 15:15:58.944470 Iteration #2 15:15:59.945631 Iteration #3 15:16:00.946794 got 15 15:16:40.473517 15:16:42.475655 ending Goodbye {code} The docker stop took a liiitle more than 2 seconds - as the grace period in the python script. I still guess the problem is somewhere in the mesos orchestrating the docker - either it sends wrong {{docker kill}} or it kills it even more painfully (killing the docker run with linux {{kill}} command... > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() >
[jira] [Updated] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Bydzovsky updated MESOS-4279: Description: I'm implementing a graceful restarts of our mesos-marathon-docker setup and I came to a following issue: (it was already discussed on https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere got to a point that its probably a docker containerizer problem...) To sum it up: When i deploy simple python script to all mesos-slaves: {code} #!/usr/bin/python from time import sleep import signal import sys import datetime def sigterm_handler(_signo, _stack_frame): print "got %i" % _signo print datetime.datetime.now().time() sys.stdout.flush() sleep(2) print datetime.datetime.now().time() print "ending" sys.stdout.flush() sys.exit(0) signal.signal(signal.SIGTERM, sigterm_handler) signal.signal(signal.SIGINT, sigterm_handler) try: print "Hello" i = 0 while True: i += 1 print datetime.datetime.now().time() print "Iteration #%i" % i sys.stdout.flush() sleep(1) finally: print "Goodbye" {code} and I run it through Marathon like {code:javascript} data = { args: ["/tmp/script.py"], instances: 1, cpus: 0.1, mem: 256, id: "marathon-test-api" } {code} During the app restart I get expected result - the task receives sigterm and dies peacefully (during my script-specified 2 seconds period) But when i wrap this python script in a docker: {code} FROM node:4.2 RUN mkdir /app ADD . /app WORKDIR /app ENTRYPOINT [] {code} and run appropriate application by Marathon: {code:javascript} data = { args: ["./script.py"], container: { type: "DOCKER", docker: { image: "bydga/marathon-test-api" }, forcePullImage: yes }, cpus: 0.1, mem: 256, instances: 1, id: "marathon-test-api" } {code} The task during restart (issued from marathon) dies immediately without having a chance to do any cleanup. was: I'm implementing a graceful restarts of our mesos-marathon-docker setup and I came to a following issue: (it was already discussed on https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere got to a point that its probably docker containerizer problem...) To sum it up: When i deploy simple python script to all mesos-slaves: {code} #!/usr/bin/python from time import sleep import signal import sys import datetime def sigterm_handler(_signo, _stack_frame): print "got %i" % _signo print datetime.datetime.now().time() sys.stdout.flush() sleep(2) print datetime.datetime.now().time() print "ending" sys.stdout.flush() sys.exit(0) signal.signal(signal.SIGTERM, sigterm_handler) signal.signal(signal.SIGINT, sigterm_handler) try: print "Hello" i = 0 while True: i += 1 print datetime.datetime.now().time() print "Iteration #%i" % i sys.stdout.flush() sleep(1) finally: print "Goodbye" {code} and I run it through Marathon like {code:javascript} data = { args: ["/tmp/script.py"], instances: 1, cpus: 0.1, mem: 256, id: "marathon-test-api" } {code} During app restart I get expected result - task receives sigterm and dies peacefully (during my script-specified 2 seconds) But when i wrap this python script in docker: {code} FROM node:4.2 RUN mkdir /app ADD . /app WORKDIR /app ENTRYPOINT [] {code} and run appropriate application by Marathon: {code:javascript} data = { args: ["./script.py"], container: { type: "DOCKER", docker: { image: "bydga/marathon-test-api" }, forcePullImage: yes }, cpus: 0.1, mem: 256, instances: 1, id: "marathon-test-api" } {code} The task during restart (issued from marathon) dies immediately without a chance to do any cleanup. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import
[jira] [Created] (MESOS-4279) Graceful restart of docker task
Martin Bydzovsky created MESOS-4279: --- Summary: Graceful restart of docker task Key: MESOS-4279 URL: https://issues.apache.org/jira/browse/MESOS-4279 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 0.25.0 Reporter: Martin Bydzovsky I'm implementing a graceful restarts of our mesos-marathon-docker setup and I came to a following issue: (it was already discussed on https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere got to a point that its probably docker containerizer problem...) To sum it up: When i deploy simple python script to all mesos-slaves: {code} #!/usr/bin/python from time import sleep import signal import sys import datetime def sigterm_handler(_signo, _stack_frame): print "got %i" % _signo print datetime.datetime.now().time() sys.stdout.flush() sleep(2) print datetime.datetime.now().time() print "ending" sys.stdout.flush() sys.exit(0) signal.signal(signal.SIGTERM, sigterm_handler) signal.signal(signal.SIGINT, sigterm_handler) try: print "Hello" i = 0 while True: i += 1 print datetime.datetime.now().time() print "Iteration #%i" % i sys.stdout.flush() sleep(1) finally: print "Goodbye" {code} and I run it through Marathon like {code:javascript} data = { args: ["/tmp/script.py"], instances: 1, cpus: 0.1, mem: 256, id: "marathon-test-api" } {code} During app restart I get expected result - task receives sigterm and dies peacefully (during my script-specified 2 seconds) But when i wrap this python script in docker: {code} FROM node:4.2 RUN mkdir /app ADD . /app WORKDIR /app ENTRYPOINT [] {code} and run appropriate application by Marathon: {code:javascript} data = { args: ["./script.py"], container: { type: "DOCKER", docker: { image: "bydga/marathon-test-api" }, forcePullImage: yes }, cpus: 0.1, mem: 256, instances: 1, id: "marathon-test-api" } {code} The task during restart (issued from marathon) dies immediately without a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)