Re: Help needed (alas, urgently)

Tim Chen Fri, 15 Jan 2016 11:04:14 -0800

Hi Paul,

No problem, I haven't even spent much time yet and glad you resolved the
problem yourself.


We always welcome people doing this :)

Tim

On Fri, Jan 15, 2016 at 10:48 AM, Paul Bell <[email protected]> wrote:

> Tim,
>
> I've tracked down the cause of this problem: it's the result of some kind
> of incompatibility between kernel 3.19 and "VMware Tools". I know little
> more than that.
>
> I installed VMware Tools via *apt-get install open-vm-tools-lts-trusty*.
> Everything worked fine on 3.13. But when I upgrade to 3.19, the error
> occurs quite reliably. Revert back to 3.13 and the error goes away.
>
> I looked high & low for some statement of kernel requirements for VMware
> Tools, but can find none.
>
> Sorry to have wasted your time.
>
> -Paul
>
> On Fri, Jan 15, 2016 at 9:19 AM, Paul Bell <[email protected]> wrote:
>
>> In chasing down this problem, I stumbled upon something of moment: the
>> problem does NOT seem to happen with kernel 3.13.
>>
>> Some weeks back, in the hope of getting past another problem wherein the
>> root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
>> LTS). The kernel upgrade was done as shown here (there's some extra stuff
>> to get rid of Ubuntu desktop and liberate some disk space):
>>
>>   apt-get update
>>   apt-get -y remove unbuntu-desktop
>>   apt-get -y purge lightdm
>>   rm -Rf /var/lib/lightdm-data
>>   apt-get -y remove --purge libreoffice-core
>>   apt-get -y remove --purge libreoffice-common
>>
>>   echo "  Installing new kernel"
>>
>>   apt-get -y install linux-generic-lts-vivid
>>   apt-get -y autoremove linux-image-3.13.0-32-generic
>>   apt-get -y autoremove linux-image-3.13.0-71-generic
>>   update-grub
>>   reboot
>>
>> After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.
>>
>> Under this kernel I can now reliably reproduce the failure to stop a
>> MongoDB container. Specifically, any & all attempts to kill the container,
>> e.g.,via
>>
>> Marathon HTTP Delete (which leads to docker-mesos-executor presenting
>> "docker stop" command)
>> Getting inside the running container shell and issuing "kill" or
>> db.shutDown()
>>
>> causes the mongod container
>>
>>    - to show in its log that it's shutting down normally
>>    - to enter a 100% CPU loop
>>    - to become unkillable (only reboot "fixes" things)
>>
>> Note finally that my conclusion about kernel 3.13 "working" is at present
>> a weak induction. But I do know that when I reverted to that kernel I
>> could, at least once, stop the containers w/o any problems; whereas at 3.19
>> I can reliably reproduce the problem. I will try to make this induction
>> stronger as the day wears on.
>>
>> Did I do something "wrong" in my kernel upgrade steps?
>>
>> Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
>> area of task termination & signal handling?
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell <[email protected]> wrote:
>>
>>> I spoke to soon, I'm afraid.
>>>
>>> Next time I did the stop (with zero timeout), I see the same phenomenon:
>>> a mongo container showing repeated:
>>>
>>> killing docker task
>>> shutting down
>>>
>>>
>>> What else can I try?
>>>
>>> Thank you.
>>>
>>> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <[email protected]> wrote:
>>>
>>>> Hi Tim,
>>>>
>>>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>>>> (though a bit fearful about being pleased) that this change seems to have
>>>> shut everyone down pretty much instantly.
>>>>
>>>> Can you explain what's happening, e.g., does docker_stop_timeout=0
>>>> cause the immediate use of "kill -9" as opposed to "kill -2"?
>>>>
>>>> I will keep testing the behavior.
>>>>
>>>> Thank you.
>>>>
>>>> -Paul
>>>>
>>>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <[email protected]> wrote:
>>>>
>>>>> Hi Tim,
>>>>>
>>>>> Things have gotten slightly odder (if that's possible). When I now
>>>>> start the application 5 or so containers, only one "ecxconfigdb" gets
>>>>> started - and even he took a few tries. That is, I see him failing, moving
>>>>> to deploying, then starting again. But I've no evidence (no STDOUT, and no
>>>>> docker ctr logs) that show why.
>>>>>
>>>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>>>> application I am seeing the phenomena I posted before: killing docker 
>>>>> task,
>>>>> shutting down repeated many times. The UN-stopped container is now running
>>>>> at 100% CPU.
>>>>>
>>>>> I will try modifying docker_stop_timeout. Back shortly
>>>>>
>>>>> Thanks again.
>>>>>
>>>>> -Paul
>>>>>
>>>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>>>
>>>>> *from /var/log/upstart/docker.log*
>>>>>
>>>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>>>> [31mERRO [0m[3054] Handler for GET
>>>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>>>> returned error: No such image:
>>>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>>>> [31mERRO [0m[3054] HTTP Error
>>>>> [31merr [0m=No such image:
>>>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>>>> [31mstatusCode [0m=404
>>>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>>>> [34mINFO [0m[3054] POST
>>>>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>>>> [34mINFO [0m[3054] POST
>>>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1
>>>>> [34mINFO [0m[3054] POST
>>>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3054] GET
>>>>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>>>> [34mINFO [0m[3111] GET /v1.21/containers/json
>>>>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>>>>> [34mINFO [0m[3120] GET
>>>>> /v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all
>>>>> [34mINFO [0m[3153] GET /containers/json
>>>>> [34mINFO [0m[3153] GET
>>>>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3153] GET
>>>>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>>>>> [34mINFO [0m[3153] GET
>>>>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>>>>> [34mINFO [0m[3175] GET /containers/json
>>>>> [34mINFO [0m[3175] GET
>>>>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>>>> [34mINFO [0m[3175] GET
>>>>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>>>>> [34mINFO [0m[3175] GET
>>>>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>>>>> * [34mINFO [0m[3175] POST
>>>>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
>>>>> ?t=15
>>>>> * [31mERRO [0m[3175] attach: stdout: write unix @: broken pipe    *
>>>>> * [34mINFO [0m[3190] Container
>>>>> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
>>>>> exit within 15 seconds of SIGTERM - using the force *
>>>>> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10
>>>>> seconds of kill - trying direct SIGKILL *
>>>>>
>>>>> *STDOUT from Mesos:*
>>>>>
>>>>> *--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> *--docker="/usr/local/ecxmcc/weaveShim" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> --stop_timeout="15secs"
>>>>> --container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> --docker="/usr/local/ecxmcc/weaveShim" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> --stop_timeout="15secs"
>>>>> Registered docker executor on 71.100.202.99
>>>>> Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285
>>>>> 2016-01-14T20:45:38.613+0000 [initandlisten] MongoDB starting : pid=1
>>>>> port=27017 dbpath=/data/db 64-bit host=ecxconfigdb
>>>>> 2016-01-14T20:45:38.614+0000 [initandlisten] db version v2.6.8
>>>>> 2016-01-14T20:45:38.614+0000 [initandlisten] git version:
>>>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700
>>>>> 2016-01-14T20:45:38.614+0000 [initandlisten] build info: Linux
>>>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
>>>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
>>>>> 2016-01-14T20:45:38.614+0000 [initandlisten] allocator: tcmalloc
>>>>> 2016-01-14T20:45:38.614+0000 [initandlisten] options: { storage: {
>>>>> journal: { enabled: true } } }
>>>>> 2016-01-14T20:45:38.616+0000 [initandlisten] journal
>>>>> dir=/data/db/journal
>>>>> 2016-01-14T20:45:38.616+0000 [initandlisten] recover : no journal
>>>>> files present, no recovery needed
>>>>> 2016-01-14T20:45:39.006+0000 [initandlisten] waiting for connections
>>>>> on port 27017
>>>>> 2016-01-14T20:46:38.975+0000 [clientcursormon] mem (MB) res:77
>>>>> virt:12942
>>>>> 2016-01-14T20:46:38.975+0000 [clientcursormon]  mapped (incl journal
>>>>> view):12762
>>>>> 2016-01-14T20:46:38.975+0000 [clientcursormon]  connections:0
>>>>> Killing docker task
>>>>> Shutting down
>>>>> Killing docker task
>>>>> Shutting down
>>>>> Killing docker task
>>>>> Shutting down
>>>>>
>>>>> On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell <[email protected]> wrote:
>>>>>
>>>>>> Hey Tim,
>>>>>>
>>>>>> Thank you very much for your reply.
>>>>>>
>>>>>> Yes, I am in the midst of trying to reproduce the problem. If
>>>>>> successful (so to speak), I will do as you ask.
>>>>>>
>>>>>> Cordially,
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>> On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Looks like we've already issued the docker stop as you seen in the
>>>>>>> ps output, but the containers are still running. Can you look at the 
>>>>>>> Docker
>>>>>>> daemon logs and see what's going on there?
>>>>>>>
>>>>>>> And also can you also try to modify docker_stop_timeout to 0 so that
>>>>>>> we SIGKILL the containers right away, and see if this still happens?
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> It's been quite some time since I've posted here and that's chiefly
>>>>>>>> because up until a day or two ago, things were working really well.
>>>>>>>>
>>>>>>>> I actually may have posted about this some time back. But then the
>>>>>>>> problem seemed more intermittent.
>>>>>>>>
>>>>>>>> In summa, several "docker stops" don't work, i.e., the containers
>>>>>>>> are not stopped.
>>>>>>>>
>>>>>>>> Deployment:
>>>>>>>>
>>>>>>>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>>>>>>>> Zookeeper
>>>>>>>> Mesos-master (0.23.0)
>>>>>>>> Mesos-slave (0.23.0)
>>>>>>>> Marathon (0.10.0)
>>>>>>>> Docker 1.9.1
>>>>>>>> Weave 1.1.0
>>>>>>>> Our application contains which include
>>>>>>>>     MongoDB (4)
>>>>>>>>     PostGres
>>>>>>>>     ECX (our product)
>>>>>>>>
>>>>>>>> The only thing that's changed at all in the config above is the
>>>>>>>> version of Docker. Used to be 1.6.2 but I today upgraded it hoping to 
>>>>>>>> solve
>>>>>>>> the problem.
>>>>>>>>
>>>>>>>>
>>>>>>>> My automater program stops the application by sending Marathon an
>>>>>>>> "http delete" for each running up. Every now & then (reliably 
>>>>>>>> reproducible
>>>>>>>> today) not all containers get stopped. Most recently, 3 containers 
>>>>>>>> failed
>>>>>>>> to stop.
>>>>>>>>
>>>>>>>> Here are the attendant phenomena:
>>>>>>>>
>>>>>>>> Marathon shows the 3 applications in deployment mode (presumably
>>>>>>>> "deployment" in the sense of "stopping")
>>>>>>>>
>>>>>>>> *ps output:*
>>>>>>>>
>>>>>>>> root@71:~# ps -ef | grep docker
>>>>>>>> root      3823     1  0 13:55 ?        00:00:02 /usr/bin/docker
>>>>>>>> daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>>>>>>>> root      4967     1  0 13:57 ?        00:00:01
>>>>>>>> /usr/sbin/mesos-slave --master=zk://71.100.202.99:2181/mesos
>>>>>>>> --log_dir=/var/log/mesos --containerizers=docker,mesos
>>>>>>>> --docker=/usr/local/ecxmcc/weaveShim --docker_stop_timeout=15secs
>>>>>>>> --executor_registration_timeout=5mins --hostname=71.100.202.99
>>>>>>>> --ip=71.100.202.99 --attributes=hostType:ecx,shard1
>>>>>>>> --resources=ports:[31000-31999,8443-8443]
>>>>>>>> root      5263  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>>>> tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2
>>>>>>>> -container-port 6783
>>>>>>>> root      5271  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>>>> udp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2
>>>>>>>> -container-port 6783
>>>>>>>> root      5279  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>>>> tcp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2
>>>>>>>> -container-port 53
>>>>>>>> root      5287  3823  0 13:57 ?        00:00:00 docker-proxy -proto
>>>>>>>> udp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2
>>>>>>>> -container-port 53
>>>>>>>> root      7119  4967  0 14:00 ?        00:00:01
>>>>>>>> mesos-docker-executor
>>>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>>>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>>>>>>>> --mapped_directory=/mnt/mesos/sandbox
>>>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>>>>>>>> --stop_timeout=15secs
>>>>>>>> root      7378  4967  0 14:00 ?        00:00:01
>>>>>>>> mesos-docker-executor
>>>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>>>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>>>>>>>> --mapped_directory=/mnt/mesos/sandbox
>>>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>>>>>>>> --stop_timeout=15secs
>>>>>>>> root      7640  4967  0 14:01 ?        00:00:01
>>>>>>>> mesos-docker-executor
>>>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>>>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>>>>>>>> --mapped_directory=/mnt/mesos/sandbox
>>>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
>>>>>>>> --stop_timeout=15secs
>>>>>>>> *root      9696  9695  0 14:06 ?        00:00:00 /usr/bin/docker
>>>>>>>> stop -t 15
>>>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
>>>>>>>> *root      9709  9708  0 14:06 ?        00:00:00 /usr/bin/docker
>>>>>>>> stop -t 15
>>>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
>>>>>>>> *root      9720  9719  0 14:06 ?        00:00:00 /usr/bin/docker
>>>>>>>> stop -t 15
>>>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*
>>>>>>>>
>>>>>>>> *docker ps output:*
>>>>>>>>
>>>>>>>> root@71:~# docker ps
>>>>>>>> CONTAINER ID        IMAGE                         COMMAND
>>>>>>>>        CREATED             STATUS              PORTS
>>>>>>>>                                                                    
>>>>>>>> NAMES
>>>>>>>> 5abafbfe7de2        mongo:2.6.8                   "/w/w
>>>>>>>> /entrypoint.sh "   11 minutes ago      Up 11 minutes       27017/tcp
>>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>>>>>>>> a8449682ca2e        mongo:2.6.8                   "/w/w
>>>>>>>> /entrypoint.sh "   11 minutes ago      Up 11 minutes       27017/tcp
>>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>>>>>>>> 3b956457374b        mongo:2.6.8                   "/w/w
>>>>>>>> /entrypoint.sh "   11 minutes ago      Up 11 minutes       27017/tcp
>>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>>>>>>>> 4c1588bb3d4b        weaveworks/weaveexec:v1.1.0
>>>>>>>> "/home/weave/weavepro"   15 minutes ago      Up 15 minutes
>>>>>>>>
>>>>>>>>          weaveproxy
>>>>>>>> a26a0363584b        weaveworks/weave:v1.1.0
>>>>>>>> "/home/weave/weaver -"   15 minutes ago      Up 15 minutes
>>>>>>>> 172.17.0.1:53->53/tcp, 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp,
>>>>>>>> 0.0.0.0:6783->6783/udp   weave
>>>>>>>>
>>>>>>>> *from /var/log/syslog:*
>>>>>>>>
>>>>>>>>
>>>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405  5002
>>>>>>>> master.cpp:2944] Asked to kill task
>>>>>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>>>> *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459  5002
>>>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 
>>>>>>>> at
>>>>>>>> slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> 
>>>>>>>> (71.100.202.99) to
>>>>>>>> kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>>>> [email protected]:46167
>>>>>>>> <http://[email protected]:46167>*
>>>>>>>> *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729  5042
>>>>>>>> slave.cpp:1755] Asked to kill task
>>>>>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000*
>>>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295  5004
>>>>>>>> http.cpp:283] HTTP GET for /master/state.json from
>>>>>>>> 172.19.15.61:65038 with User-Agent='Mozilla/5.0 (Windows NT 6.1;
>>>>>>>> WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86
>>>>>>>> Safari/537.36'
>>>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904  5001
>>>>>>>> master.cpp:2944] Asked to kill task
>>>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935  5001
>>>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 
>>>>>>>> at
>>>>>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
>>>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>>>> [email protected]:46167
>>>>>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136  5041
>>>>>>>> slave.cpp:1755] Asked to kill task
>>>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932  4998
>>>>>>>> master.cpp:2944] Asked to kill task
>>>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958  4998
>>>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 
>>>>>>>> at
>>>>>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
>>>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>>>> [email protected]:46167
>>>>>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151  5038
>>>>>>>> slave.cpp:1755] Asked to kill task
>>>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000
>>>>>>>> Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009  5001
>>>>>>>> master.cpp:4290] Sending 1 offers to framework
>>>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at
>>>>>>>> [email protected]:46167
>>>>>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO
>>>>>>>> started processing 1 offers, launching at most 1 tasks per offer and 
>>>>>>>> 1000
>>>>>>>> tasks in total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132)
>>>>>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO
>>>>>>>> Offer [20160114-135722-1674208327-5050-4917-O128]. Decline with default
>>>>>>>> filter refuseSeconds (use --decline_offer_duration to configure)
>>>>>>>> (mesosphere.marathon.tasks.IterativeOfferMatcher$:231)
>>>>>>>>
>>>>>>>>
>>>>>>>> *from Mesos STDOUT of unstopped container:*
>>>>>>>>
>>>>>>>> Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9
>>>>>>>> 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting :
>>>>>>>> pid=1 port=27019 dbpath=/data/db/config master=1 64-bit 
>>>>>>>> host=mongoconfig
>>>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8
>>>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] git version:
>>>>>>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700
>>>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux
>>>>>>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
>>>>>>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
>>>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc
>>>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: {
>>>>>>>> clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", 
>>>>>>>> journal:
>>>>>>>> { enabled: true } } }
>>>>>>>> 2016-01-14T19:01:10.999+0000 [initandlisten] journal
>>>>>>>> dir=/data/db/config/journal
>>>>>>>> 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal
>>>>>>>> files present, no recovery needed
>>>>>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] warning:
>>>>>>>> ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: 
>>>>>>>> {
>>>>>>>> opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
>>>>>>>> "query", ns: "local.oplog.$main", query: { query: {}, orderby: { 
>>>>>>>> $natural:
>>>>>>>> -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
>>>>>>>> "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 
>>>>>>>> 0,
>>>>>>>> lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
>>>>>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for
>>>>>>>> connections on port 27019
>>>>>>>> 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted
>>>>>>>> from 10.2.0.3:51189 #1 (1 connection now open)
>>>>>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted
>>>>>>>> from 10.2.0.3:51190 #2 (2 connections now open)
>>>>>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted
>>>>>>>> from 10.2.0.3:51191 #3 (3 connections now open)
>>>>>>>> 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation
>>>>>>>> detected, adding sharding hook to enable versioning and authentication 
>>>>>>>> to
>>>>>>>> remote servers
>>>>>>>> 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
>>>>>>>> 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted
>>>>>>>> from 10.2.0.3:51193 #4 (4 connections now open)
>>>>>>>> 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted
>>>>>>>> from 10.2.0.3:51194 #5 (5 connections now open)
>>>>>>>> 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189
>>>>>>>> (4 connections now open)
>>>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59
>>>>>>>> virt:385
>>>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon]  mapped (incl
>>>>>>>> journal view):192
>>>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon]  connections:4
>>>>>>>> 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72
>>>>>>>> virt:385
>>>>>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon]  mapped (incl
>>>>>>>> journal view):192
>>>>>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon]  connections:4
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>> Shutting down
>>>>>>>> Killing docker task
>>>>>>>>
>>>>>>>> Most disturbing in all of this is that while I can stop the
>>>>>>>> deployments in Marathon (which properly end the "docker stop" commands
>>>>>>>> visible in ps output), I can not bounce docker, not by Upstart, nor via
>>>>>>>> kill command).  Ultimately, I have to reboot the VM.
>>>>>>>>
>>>>>>>> FWIW, the 3 mongod containers (apparently stuck in their Killing
>>>>>>>> docker task/shutting down loop) are running at 100%CPU as evinced by 
>>>>>>>> both
>>>>>>>> "docker stats" and "top".
>>>>>>>>
>>>>>>>> I would truly be grateful for some guidance on this - even a mere
>>>>>>>> work-around would be appreciated.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> -Paul
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Help needed (alas, urgently)

Reply via email to