Re: Help needed (alas, urgently)

Paul Bell Thu, 14 Jan 2016 12:39:07 -0800

Hey Tim,

Thank you very much for your reply.


Yes, I am in the midst of trying to reproduce the problem. If successful
(so to speak), I will do as you ask.

Cordially,

Paul

On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Paul,
>
> Looks like we've already issued the docker stop as you seen in the ps
> output, but the containers are still running. Can you look at the Docker
> daemon logs and see what's going on there?
>
> And also can you also try to modify docker_stop_timeout to 0 so that we
> SIGKILL the containers right away, and see if this still happens?
>
> Tim
>
>
>
> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> It's been quite some time since I've posted here and that's chiefly
>> because up until a day or two ago, things were working really well.
>>
>> I actually may have posted about this some time back. But then the
>> problem seemed more intermittent.
>>
>> In summa, several "docker stops" don't work, i.e., the containers are not
>> stopped.
>>
>> Deployment:
>>
>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>> Zookeeper
>> Mesos-master (0.23.0)
>> Mesos-slave (0.23.0)
>> Marathon (0.10.0)
>> Docker 1.9.1
>> Weave 1.1.0
>> Our application contains which include
>>     MongoDB (4)
>>     PostGres
>>     ECX (our product)
>>
>> The only thing that's changed at all in the config above is the version
>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>> problem.
>>
>>
>> My automater program stops the application by sending Marathon an "http
>> delete" for each running up. Every now & then (reliably reproducible today)
>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>
>> Here are the attendant phenomena:
>>
>> Marathon shows the 3 applications in deployment mode (presumably
>> "deployment" in the sense of "stopping")
>>
>> *ps output:*
>>
>> root@71:~# ps -ef | grep docker
>> root      3823     1  0 13:55 ?        00:00:02 /usr/bin/docker daemon -H
>> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>> root      4967     1  0 13:57 ?        00:00:01 /usr/sbin/mesos-slave
>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>> --hostname=71.100.202.99 --ip=71.100.202.99
>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>> root      5263  3823  0 13:57 ?        00:00:00 docker-proxy -proto tcp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root      5271  3823  0 13:57 ?        00:00:00 docker-proxy -proto udp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root      5279  3823  0 13:57 ?        00:00:00 docker-proxy -proto tcp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root      5287  3823  0 13:57 ?        00:00:00 docker-proxy -proto udp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root      7119  4967  0 14:00 ?        00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>> --stop_timeout=15secs
>> root      7378  4967  0 14:00 ?        00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --stop_timeout=15secs
>> root      7640  4967  0 14:01 ?        00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
>> --stop_timeout=15secs
>> *root      9696  9695  0 14:06 ?        00:00:00 /usr/bin/docker stop -t
>> 15
>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
>> *root      9709  9708  0 14:06 ?        00:00:00 /usr/bin/docker stop -t
>> 15
>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
>> *root      9720  9719  0 14:06 ?        00:00:00 /usr/bin/docker stop -t
>> 15
>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*
>>
>> *docker ps output:*
>>
>> root@71:~# docker ps
>> CONTAINER ID        IMAGE                         COMMAND
>>  CREATED             STATUS              PORTS
>>                                                              NAMES
>> 5abafbfe7de2        mongo:2.6.8                   "/w/w /entrypoint.sh "
>>   11 minutes ago      Up 11 minutes       27017/tcp
>>
>>  
>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> a8449682ca2e        mongo:2.6.8                   "/w/w /entrypoint.sh "
>>   11 minutes ago      Up 11 minutes       27017/tcp
>>
>>  
>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> 3b956457374b        mongo:2.6.8                   "/w/w /entrypoint.sh "
>>   11 minutes ago      Up 11 minutes       27017/tcp
>>
>>  
>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> 4c1588bb3d4b        weaveworks/weaveexec:v1.1.0   "/home/weave/weavepro"
>>   15 minutes ago      Up 15 minutes
>>                                                              weaveproxy
>> a26a0363584b        weaveworks/weave:v1.1.0       "/home/weave/weaver -"
>>   15 minutes ago      Up 15 minutes       172.17.0.1:53->53/tcp,
>> 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->6783/udp
>> weave
>>
>> *from /var/log/syslog:*
>>
>>
>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405  5002
>> master.cpp:2944] Asked to kill task
>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000
>> *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459  5002
>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
>> slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> (71.100.202.99) to
>> kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>> <http://scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167>*
>> *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729  5042
>> slave.cpp:1755] Asked to kill task
>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000*
>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295  5004
>> http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038
>> with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
>> (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904  5001
>> master.cpp:2944] Asked to kill task
>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000
>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935  5001
>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136  5041
>> slave.cpp:1755] Asked to kill task
>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000
>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932  4998
>> master.cpp:2944] Asked to kill task
>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000
>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958  4998
>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151  5038
>> slave.cpp:1755] Asked to kill task
>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>> 20160114-103414-1674208327-5050-3293-0000
>> Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009  5001
>> master.cpp:4290] Sending 1 offers to framework
>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO started
>> processing 1 offers, launching at most 1 tasks per offer and 1000 tasks in
>> total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132)
>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO Offer
>> [20160114-135722-1674208327-5050-4917-O128]. Decline with default filter
>> refuseSeconds (use --decline_offer_duration to configure)
>> (mesosphere.marathon.tasks.IterativeOfferMatcher$:231)
>>
>>
>> *from Mesos STDOUT of unstopped container:*
>>
>> Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9
>> 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : pid=1
>> port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig
>> 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8
>> 2016-01-14T19:01:10.998+0000 [initandlisten] git version:
>> 3abc04d6d4f71de00b57378e3277def8fd7a6700
>> 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux
>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27
>> UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
>> 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc
>> 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: {
>> clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", journal:
>> { enabled: true } } }
>> 2016-01-14T19:01:10.999+0000 [initandlisten] journal
>> dir=/data/db/config/journal
>> 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal files
>> present, no recovery needed
>> 2016-01-14T19:01:11.429+0000 [initandlisten] warning:
>> ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: {
>> opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
>> "query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural:
>> -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
>> "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0,
>> lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
>> 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections on
>> port 27019
>> 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted from
>> 10.2.0.3:51189 #1 (1 connection now open)
>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from
>> 10.2.0.3:51190 #2 (2 connections now open)
>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from
>> 10.2.0.3:51191 #3 (3 connections now open)
>> 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation detected,
>> adding sharding hook to enable versioning and authentication to remote
>> servers
>> 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>> 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted from
>> 10.2.0.3:51193 #4 (4 connections now open)
>> 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted from
>> 10.2.0.3:51194 #5 (5 connections now open)
>> 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189 (4
>> connections now open)
>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59 virt:385
>> 2016-01-14T19:02:11.285+0000 [clientcursormon]  mapped (incl journal
>> view):192
>> 2016-01-14T19:02:11.285+0000 [clientcursormon]  connections:4
>> 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 virt:385
>> 2016-01-14T19:03:11.294+0000 [clientcursormon]  mapped (incl journal
>> view):192
>> 2016-01-14T19:03:11.294+0000 [clientcursormon]  connections:4
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>> Shutting down
>> Killing docker task
>>
>> Most disturbing in all of this is that while I can stop the deployments
>> in Marathon (which properly end the "docker stop" commands visible in ps
>> output), I can not bounce docker, not by Upstart, nor via kill command).
>> Ultimately, I have to reboot the VM.
>>
>> FWIW, the 3 mongod containers (apparently stuck in their Killing docker
>> task/shutting down loop) are running at 100%CPU as evinced by both "docker
>> stats" and "top".
>>
>> I would truly be grateful for some guidance on this - even a mere
>> work-around would be appreciated.
>>
>> Thank you.
>>
>> -Paul
>>
>
>

Re: Help needed (alas, urgently)

Reply via email to