Re: Share GPU resources via attributes or as custom resources (INTERNAL)

2016-01-14 Thread haosdent
>Then, if a job is sent to the machine when the 4 GPUs are already busy,
the job will fail to start, right?
I not sure this. But if job fail, Marathon would retry as you said.

>a job is sent to the machine, all 4 GPUs will become busy
If you specify your task only use 1 gpu in resources field. I think Mesos
could continue provide offers which have gpu. And I remember Marathon
constraints
only could work with --attributes.

On Fri, Jan 15, 2016 at 1:02 AM,  wrote:

> I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule
> the jobs to be run in the machine. Each job will use maximum 1 GPU and
> sharing 1 GPU between small jobs would be ok.
> I know Mesos does not directly support GPUs, but it seems I might use
> custom resources or attributes to do what I want. But how exactly should
> this be done?
>
> If I use --attributes="hasGpu:true", would a job be sent to the machine
> when another job is already running in the machine (and only using 1 GPU)?
> I would say all jobs requesting a machine with a hasGpu attribute would be
> sent to the machine (as long as it has free CPU and memory resources).
> Then, if a job is sent to the machine when the 4 GPUs are already busy, the
> job will fail to start, right? Could then Marathon be used to re-send the
> job after some time, until it is accepted by the machine?
>
> If I specify --resources="gpu(*):4", it is my understanding that once a
> job is sent to the machine, all 4 GPUs will become busy to the eyes of
> Mesos (even if this is not really true). If that is right, would this
> work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and
> gpu:D; and use constraints in Marathon like this  "constraints": [["gpu",
> "LIKE", " [A-D]"]]?
>
> Cheers
>



-- 
Best Regards,
Haosdent Huang


Re: Powered by mesos list

2016-01-14 Thread o...@magnetic.io
tnx!

Would like to hear opinions on how to categorise our solution and maybe 
restructure/rephrase this page:

Vamp is not so much “built on Mesos” but “makes use of Mesos” as it offers 
higher-level features using our Mesos/Marathon-driver. Maybe it’s semantics but 
i just wanted to check with the community.

Also, we’re a Canary-testing and releasing framework, which doesn’t really seem 
to fit the current categories in the “built on Mesos” article. The most fitting 
category would be “Batch Scheduling” but that wouldn’t entirely fit the 
use-case of Vamp. My suggestion would be “Continuous deployment, testing and 
scaling”.

Any thoughts/suggestions?

tnx, Olaf


> On 05 Jan 2016, at 02:54, Benjamin Mahler  wrote:
> 
> There are two sections: 'Organizations Using Mesos' and 'Software projects 
> built on Mesos'. The latter links to the list of frameworks.
> 
> If you fit either of these descriptions, then we can get you added, just 
> forward to us a pull request or reviewboard request.
> 
> On Tue, Dec 8, 2015 at 10:34 PM, Olaf Magnetic  > wrote:
> Hi Benjamin,
> 
> What are the criteria to be included on the powered by mesos list? Would love 
> to have our canary-test and release framework VAMP (www..vamp.io 
> ) which runs on mesos/marathon on this list too. 
> 
> Cheers, Olaf 
> 
> 
> On 08 Dec 2015, at 22:36, Benjamin Mahler  > wrote:
> 
>> Thanks for sharing Arunabha! I'm a big fan of the multi-framework compute 
>> platform approach, please share your feedback along the way :)
>> 
>> Would you like to be added to the powered by mesos list?
>> https://github.com/apache/mesos/blob/master/docs/powered-by-mesos.md 
>> 
>> 
>> On Mon, Dec 7, 2015 at 1:30 PM, Arunabha Ghosh > > wrote:
>> Hi Folks,
>>   We, at Moz have been working for a while on RogerOS, our next 
>> gen application platform built on top of Mesos. We've reached a point in the 
>> project where we feel it's ready to share with the world :-)
>> 
>> The blog posts introducing RogerOS can be found at
>>  
>> https://moz.com/devblog/introducing-rogeros-part-1/ 
>> 
>> https://moz.com/devblog/introducing-rogeros-part-2/ 
>> 
>> 
>> I can safely say that without Mesos, it would not have been possible for us 
>> to have built the system within the constraints of time and resources that 
>> we had. As we note in the blog 
>> 
>> " We are very glad that we chose Mesos though. It has delivered on all of 
>> its promises and more. We’ve had no issues with stability, extensibility, 
>> and performance of the system and it has allowed us to achieve our goals 
>> with a fraction of the development resources that would have been required 
>> otherwise. "
>> 
>> We would also like to thank the wonderful Mesos community for all the help 
>> and support we've received. Along the way we've tried to contribute back to 
>> the community through talks at Mesoscon and now through open sourcing our 
>> efforts.
>> 
>> Your feedback and thoughts are always welcome !
>> 
>> Thanks,
>> Arunabha
>> 
>>   
>> 
> 



Tasks failing when restarting slave on Mesos 0.23.1

2016-01-14 Thread Matthias Bach
Hi all,

We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we
have been using the JSON format for Mesos' credential files. However,
because of MESOS-3695 we decided to switch to the plain text format
before updating to 0.24.1. Our understanding is that this should be a
NOOP. However, on our cluster this caused multiple tasks to fail on each
slave.

I have attached two excerpts from the Mesos slave log. One were I
grepped for the executor ID of one of the failed tasks, and one were I
grepped for the ID of the corresponding container. What you can see is
that recovery of the container  is started and – immediately afterwards
– the executer killed.

Our change procedure was:
* Place the new plain-text credential file
* Restart the slave with `--credential` pointing to the new file
* Remove the old JSON credential file

We are running the Mesos slave using supervisord and use the following
isolators: cgroups/cpu, cgroups/mem, filesystem/shared, namespaces/pid,
and posix/disk. In addition we use `--enforce_container_disk_quota`.
Regarding recovery we use the options `--recover="reconnect"` and
`--strict="false"`.

The Thermos log does not provide any hints as to what happened. It looks
like Thermos was SIGKILLed.

Has any of you run into this problem before? Do you have an idea what
could cause this behaviour? Do you have any suggestion what information
we could look for to better understand what happens?

Kind Regards,
Matthias

-- 
Dr. Matthias Bach
Senior Software Engineer
*Blue Yonder GmbH*
Ohiostraße 8
D-76149 Karlsruhe

Tel +49 (0)721 383 117 6244
Fax +49 (0)721 383 117 69

matthias.b...@blue-yonder.com
www.blue-yonder.com
Registergericht Mannheim, HRB 704547
USt-IdNr. DE DE 277 091 535
Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)

I0114 14:09:51.213526 23008 containerizer.cpp:371] Recovering container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.230132 23056 mem.cpp:602] Started listening for OOM events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230499 23056 mem.cpp:718] Started listening on low memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230828 23056 mem.cpp:718] Started listening on medium memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.231233 23056 mem.cpp:718] Started listening on critical memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.584983 23014 containerizer.cpp:1001] Destroying container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a'
I0114 14:09:53.585428 23014 linux_launcher.cpp:358] Using pid namespace to destroy container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.800837 23014 containerizer.cpp:1188] Executor for container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' has exited
I0114 14:09:53.802088 23012 cgroups.cpp:2382] Freezing cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.803673 22996 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 1.552896ms
I0114 14:09:53.804822 23008 cgroups.cpp:2399] Thawing cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.806593 23012 cgroups.cpp:1444] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 1.753856ms
W0114 14:09:54.639930 23014 containerizer.cpp:885] Ignoring update for unknown container: e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:54.708149 23002 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.9180405037days in the future
I0114 14:09:54.708226 22996 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.9180321778days in the future
 

I0114 14:09:51.090075 22993 slave.cpp:4842] Recovering executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.168311 22999 status_update_manager.cpp:210] Recovering executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.213526 23008 

Tasks failing when restarting slave on Mesos 0.23.1

2016-01-14 Thread Matthias Bach
Hi all,

We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we
have been using the JSON format for Mesos' credential files. However,
because of MESOS-3695 we decided to switch to the plain text format
before updating to 0.24.1. Our understanding is that this should be a
NOOP. However, on our cluster this caused multiple tasks to fail on each
slave.

I have attached two excerpts from the Mesos slave log. One were I
grepped for the executor ID of one of the failed tasks, and one were I
grepped for the ID of the corresponding container. What you can see is
that recovery of the container  is started and – immediately afterwards
– the executer killed.

Our change procedure was:
* Place the new plain-text credential file
* Restart the slave with `--credential` pointing to the new file
* Remove the old JSON credential file

We are running the Mesos slave using supervisord and use the following
isolators: cgroups/cpu, cgroups/mem, filesystem/shared, namespaces/pid,
and posix/disk. In addition we use `--enforce_container_disk_quota`.
Regarding recovery we use the options `--recover="reconnect"` and
`--strict="false"`.

The Thermos log does not provide any hints as to what happened. It looks
like Thermos was SIGKILLed.

Has any of you run into this problem before? Do you have an idea what
could cause this behaviour? Do you have any suggestion what information
we could look for to better understand what happens?

Kind Regards,
Matthias

-- 
Dr. Matthias Bach
Senior Software Engineer
*Blue Yonder GmbH*
Ohiostraße 8
D-76149 Karlsruhe

Tel +49 (0)721 383 117 6244
Fax +49 (0)721 383 117 69

matthias.b...@blue-yonder.com 
www.blue-yonder.com 
Registergericht Mannheim, HRB 704547
USt-IdNr. DE DE 277 091 535
Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)



I0114 14:09:51.213526 23008 containerizer.cpp:371] Recovering container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.230132 23056 mem.cpp:602] Started listening for OOM events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230499 23056 mem.cpp:718] Started listening on low memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230828 23056 mem.cpp:718] Started listening on medium memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.231233 23056 mem.cpp:718] Started listening on critical memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.584983 23014 containerizer.cpp:1001] Destroying container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a'
I0114 14:09:53.585428 23014 linux_launcher.cpp:358] Using pid namespace to destroy container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.800837 23014 containerizer.cpp:1188] Executor for container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' has exited
I0114 14:09:53.802088 23012 cgroups.cpp:2382] Freezing cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.803673 22996 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 1.552896ms
I0114 14:09:53.804822 23008 cgroups.cpp:2399] Thawing cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.806593 23012 cgroups.cpp:1444] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 1.753856ms
W0114 14:09:54.639930 23014 containerizer.cpp:885] Ignoring update for unknown container: e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:54.708149 23002 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.9180405037days in the future
I0114 14:09:54.708226 22996 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.9180321778days in the future
 

I0114 14:09:51.090075 22993 slave.cpp:4842] Recovering executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.168311 22999 status_update_manager.cpp:210] Recovering executor 

Tasks failing when restarting slave on Mesos 0.23.1

2016-01-14 Thread Bach, Matthias
Hi all,

We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we
have been using the JSON format for Mesos' credential files. However,
because of MESOS-3695 we decided to switch to the plain text format
before updating to 0.24.1. Our understanding is that this should be a
NOOP. However, on our cluster this caused multiple tasks to fail on each
slave.

I have attached two excerpts from the Mesos slave log. One were I
grepped for the executor ID of one of the failed tasks, and one were I
grepped for the ID of the corresponding container. What you can see is
that recovery of the container is started and - immediately afterwards
- the executer killed.

Our change procedure was:
* Place the new plain-text credential file
* Restart the slave with `--credential` pointing to the new file
* Remove the old JSON credential file

We are running the Mesos slave using supervisord and use the following
isolators: cgroups/cpu, cgroups/mem, filesystem/shared, namespaces/pid,
and posix/disk. In addition we use `--enforce_container_disk_quota`.
Regarding recovery we use the options `--recover="reconnect"` and
`--strict="false"`.

The Thermos log does not provide any hints as to what happened. It looks like 
Thermos was SIGKILLed.

Has any of you run into this problem before? Do you have an idea what
could cause this behaviour? Do you have any suggestion what information
we could look for to better understand what happens?

Kind Regards,
Matthias

--
Dr. Matthias Bach
Senior Software Engineer
*Blue Yonder GmbH*
Ohiostraße 8
D-76149 Karlsruhe

Tel +49 (0)721 383 117 6244
Fax +49 (0)721 383 117 69

matthias.b...@blue-yonder.com
www.blue-yonder.com
Registergericht Mannheim, HRB 704547
USt-IdNr. DE DE 277 091 535
Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)

I0114 14:09:51.213526 23008 containerizer.cpp:371] Recovering container 
'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for executor 
'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3'
 of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.230132 23056 mem.cpp:602] Started listening for OOM events for 
container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230499 23056 mem.cpp:718] Started listening on low memory 
pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230828 23056 mem.cpp:718] Started listening on medium memory 
pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.231233 23056 mem.cpp:718] Started listening on critical memory 
pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.584983 23014 containerizer.cpp:1001] Destroying container 
'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a'
I0114 14:09:53.585428 23014 linux_launcher.cpp:358] Using pid namespace to 
destroy container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.800837 23014 containerizer.cpp:1188] Executor for container 
'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' has exited
I0114 14:09:53.802088 23012 cgroups.cpp:2382] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.803673 22996 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 
1.552896ms
I0114 14:09:53.804822 23008 cgroups.cpp:2399] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.806593 23012 cgroups.cpp:1444] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 
1.753856ms
W0114 14:09:54.639930 23014 containerizer.cpp:885] Ignoring update for unknown 
container: e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:54.708149 23002 gc.cpp:56] Scheduling 
'/var/lib/mesos/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a'
 for gc 6.9180405037days in the future
I0114 14:09:54.708226 22996 gc.cpp:56] Scheduling 
'/var/lib/mesos/meta/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a'
 for gc 6.9180321778days in the future
 
I0114 14:09:51.090075 22993 slave.cpp:4842] Recovering executor 
'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3'
 of framework 20150930-134812-84017418-5050-29407-0001  


  
I0114 14:09:51.168311 22999 status_update_manager.cpp:210] Recovering executor 
'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3'
 of framework 20150930-134812-84017418-5050-29407-0001
I0114 

Share GPU resources via attributes or as custom resources (INTERNAL)

2016-01-14 Thread Humberto.Castejon
I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule the 
jobs to be run in the machine. Each job will use maximum 1 GPU and sharing 1 
GPU between small jobs would be ok.
I know Mesos does not directly support GPUs, but it seems I might use custom 
resources or attributes to do what I want. But how exactly should this be done?

If I use --attributes="hasGpu:true", would a job be sent to the machine when 
another job is already running in the machine (and only using 1 GPU)? I would 
say all jobs requesting a machine with a hasGpu attribute would be sent to the 
machine (as long as it has free CPU and memory resources). Then, if a job is 
sent to the machine when the 4 GPUs are already busy, the job will fail to 
start, right? Could then Marathon be used to re-send the job after some time, 
until it is accepted by the machine?

If I specify --resources="gpu(*):4", it is my understanding that once a job is 
sent to the machine, all 4 GPUs will become busy to the eyes of Mesos (even if 
this is not really true). If that is right, would this work-around work: 
specify 4 different resources: gpu:A, gpu:B, gpu:C and gpu:D; and use 
constraints in Marathon like this  "constraints": [["gpu", "LIKE", " [A-D]"]]?

Cheers


Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi All,

It's been quite some time since I've posted here and that's chiefly because
up until a day or two ago, things were working really well.

I actually may have posted about this some time back. But then the problem
seemed more intermittent.

In summa, several "docker stops" don't work, i.e., the containers are not
stopped.

Deployment:

one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
Zookeeper
Mesos-master (0.23.0)
Mesos-slave (0.23.0)
Marathon (0.10.0)
Docker 1.9.1
Weave 1.1.0
Our application contains which include
MongoDB (4)
PostGres
ECX (our product)

The only thing that's changed at all in the config above is the version of
Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
problem.


My automater program stops the application by sending Marathon an "http
delete" for each running up. Every now & then (reliably reproducible today)
not all containers get stopped. Most recently, 3 containers failed to stop.

Here are the attendant phenomena:

Marathon shows the 3 applications in deployment mode (presumably
"deployment" in the sense of "stopping")

*ps output:*

root@71:~# ps -ef | grep docker
root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
--master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
--containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
--docker_stop_timeout=15secs --executor_registration_timeout=5mins
--hostname=71.100.202.99 --ip=71.100.202.99
--attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
--stop_timeout=15secs
root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--stop_timeout=15secs
root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
--stop_timeout=15secs
*root  9696  9695  0 14:06 ?    00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
*root  9709  9708  0 14:06 ?    00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
*root  9720  9719  0 14:06 ?    00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*

*docker ps output:*

root@71:~# docker ps
CONTAINER IDIMAGE COMMAND
 CREATED STATUS  PORTS
 NAMES
5abafbfe7de2mongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 minutes   27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
a8449682ca2emongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 minutes   27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
3b956457374bmongo:2.6.8   

Re: slave nodes are living in two cluster and can not remove correctly.

2016-01-14 Thread X Brick
sorry for the wrong api response of cluster A

{
>   "active": true,
>   "attributes": {
> "apps": "logstash",
> "colo": "cn5",
> "type": "prod"
>   },
>   "hostname": "l-bu128g5-10k10.ops.cn2.qunar.com",
>   "id": "20151230-034049-3282655242-5050-1802-S7",
>   "pid": "slave(1)@10.90.5.19:5051",
>   "registered_time": 1452094227.39161,
>   "reregistered_time": 1452831994.32924,
>   "resources": {
> "cpus": 32,
> "disk": 2728919,
> "mem": 128126,
> "ports": "[8100-1, 31000-32000]"
>   }
> }
>

2016-01-15 12:22 GMT+08:00 X Brick :

> Hi folks,
>
> I meet a very strange issue when I migrated two nodes from one cluster to
> another about one week ago.
>
> Two nodes:
>
> l-bu128g3-10k10.ops.cn2
> l-bu128g5-10k10.ops.cn2
>
> I did not clean the mesos data dir before they join the another cluster,
> then I found the nodes live in two cluster at the same time.
>
> Cluster A (Mesos 0.22):
>
>
> Cluster B (Mesos 0.25):
>
>
> ​
> ​
> I thought maybe the old data make these happened, so I clear up these two
> nodes data dir and rejoin the cluster A. But nothing changed, they still
> come back to the old cluster(Cluster B).
>
>
> Here is the "/master/slaves" response:
>
> Cluster A:
>
> {
>>   "slaves": [
>> {
>>   "active": true,
>>   "attributes": {
>> "apps": "logstash",
>> "colo": "cn5",
>> "type": "prod"
>>   },
>>   "hostname": "l-bu128g9-10k10.ops.cn2.qunar.com",
>>   "id": "3e7ba6b1-29fd-44e8-9be2-f72896054ac6-S5",
>>   "pid": "slave(1)@10.90.5.23:5051",
>>   "registered_time": 1451990379.49813,
>>   "reregistered_time": 1452093251.39516,
>>   "resources": {
>> "cpus": 32,
>> "disk": 2728919,
>> "mem": 128126,
>> "ports": "[8100-1, 31000-32000]"
>>   }
>> },
>>
>>
> Cluster B:
>
> {
>>   "slaves": [
>> {
>>   "active": false,
>>   "attributes": {
>> "apps": "logstash",
>> "colo": "cn5",
>> "type": "prod"
>>   },
>>   "hostname": "l-bu128g5-10k10.ops.cn2.qunar.com",
>>   "id": "3e7ba6b1-29fd-44e8-9be2-f72896054ac6-S2",
>>   "offered_resources": {
>> "cpus": 0,
>> "disk": 0,
>> "mem": 0
>>   },
>>   "pid": "slave(1)@10.90.5.19:5051",
>>   "registered_time": 1451988622.66323,
>>   "reserved_resources": {},
>>   "resources": {
>> "cpus": 32.0,
>> "disk": 2728919.0,
>> "mem": 128126.0,
>> "ports": "[8100-1, 31000-32000]"
>>   },
>>   "unreserved_resources": {
>> "cpus": 32.0,
>> "disk": 2728919.0,
>> "mem": 128126.0,
>> "ports": "[8100-1, 31000-32000]"
>>   },
>>   "used_resources": {
>> "cpus": 0,
>> "disk": 0,
>> "mem": 0
>>   }
>> },
>> .
>>
>>
>
> I found some useful logs:
>
>
>> I0105 18:36:22.683724 6452 slave.cpp:2248] Updated checkpointed resources
>> from to
>> I0105 18:37:09.900497 6459 slave.cpp:3926] Current disk usage 0.06%. Max
>> allowed age: 1.798706758587755days
>> I0105 18:37:22.678374 6453 slave.cpp:3146] Master marked the slave as
>> disconnected but the slave considers itself registered! Forcing
>> re-registration.
>> I0105 18:37:22.678699 6453 slave.cpp:694] Re-detecting master
>> I0105 18:37:22.678715 6471 status_update_manager.cpp:176] Pausing sending
>> status updates
>> I0105 18:37:22.678753 6453 slave.cpp:741] Detecting new master
>> I0105 18:37:22.678977 6456 status_update_manager.cpp:176] Pausing sending
>> status updates
>> I0105 18:37:22.679047 6455 slave.cpp:705] New master detected at
>> master@10.88.169.195:5050
>> I0105 18:37:22.679108 6455 slave.cpp:768] Authenticating with master
>> master@10.88.169.195:5050
>> I0105 18:37:22.679136 6455 slave.cpp:773] Using default CRAM-MD5
>> authenticatee
>> I0105 18:37:22.679239 6455 slave.cpp:741] Detecting new master
>> I0105 18:37:22.679354 6464 authenticatee.cpp:115] Creating new client
>> SASL connection
>> I0105 18:37:22.680883 6461 authenticatee.cpp:206] Received SASL
>> authentication mechanisms: CRAM-MD5
>> I0105 18:37:22.680946 6461 authenticatee.cpp:232] Attempting to
>> authenticate with mechanism 'CRAM-MD5'
>> I0105 18:37:22.681759 6455 authenticatee.cpp:252] Received SASL
>> authentication step
>> I0105 18:37:22.682874 6454 authenticatee.cpp:292] Authentication success
>> I0105 18:37:22.682986 6441 slave.cpp:836] Successfully authenticated with
>> master master@10.88.169.195:5050
>> I0105 18:37:22.684303 6454 slave.cpp:980] Re-registered with master
>> master@10.88.169.195:5050
>> I0105 18:37:22.684455 6454 slave.cpp:1016] Forwarding total
>> oversubscribed resources
>> I0105 18:37:22.684471 6468 status_update_manager.cpp:183] Resuming
>> sending status updates
>> I0105 18:37:22.684649 6454 slave.cpp:2152] Updating framework
>> 20150610-204949-3299432458-5050-25057- pid to
>> scheduler-1bef8172-5068-44c6-93f5-e97a3910ed79@10.88.169.195:35708
>> 

Re: slave nodes are living in two cluster and can not remove correctly.

2016-01-14 Thread Shuai Lin
Based on your description, you have two clusters:

- old cluster B, with mesos 0.25, and the master ip is 10.88.169.195
- new cluster A, with mesos 0.22, and the master ip is 10.90.12.29

Also you have a slave S, 10.90.5.19, which was originally in cluster B, and
you have reconfigured it to join cluster A, but forgot to cleanup the slave
work dir.

>From the logs, S is now registered with cluster A (which is what you
intended), but S is still shown in the slaves list of cluster B (which is
confusing), and the master of cluster B is still sending messages to S:

```
W0105 19:05:38.207882 6450 slave.cpp:1973] Ignoring shutdown framework
message for 3e7ba6b1-29fd-44e8-9be2-f72896054ac6-0116 from
master@10.90.12.29:5050 because it is not from the registered master (
master@10.88.169.195:5050)
```

What's in the master logs of cluster A and B?  That could help others
understand the problem.



On Fri, Jan 15, 2016 at 12:27 PM, X Brick  wrote:

> sorry for the wrong api response of cluster A
>
> {
>>   "active": true,
>>   "attributes": {
>> "apps": "logstash",
>> "colo": "cn5",
>> "type": "prod"
>>   },
>>   "hostname": "l-bu128g5-10k10.ops.cn2.qunar.com",
>>   "id": "20151230-034049-3282655242-5050-1802-S7",
>>   "pid": "slave(1)@10.90.5.19:5051",
>>   "registered_time": 1452094227.39161,
>>   "reregistered_time": 1452831994.32924,
>>   "resources": {
>> "cpus": 32,
>> "disk": 2728919,
>> "mem": 128126,
>> "ports": "[8100-1, 31000-32000]"
>>   }
>> }
>>
>
> 2016-01-15 12:22 GMT+08:00 X Brick :
>
>> Hi folks,
>>
>> I meet a very strange issue when I migrated two nodes from one cluster to
>> another about one week ago.
>>
>> Two nodes:
>>
>> l-bu128g3-10k10.ops.cn2
>> l-bu128g5-10k10.ops.cn2
>>
>> I did not clean the mesos data dir before they join the another cluster,
>> then I found the nodes live in two cluster at the same time.
>>
>> Cluster A (Mesos 0.22):
>>
>>
>> Cluster B (Mesos 0.25):
>>
>>
>> ​
>> ​
>> I thought maybe the old data make these happened, so I clear up these two
>> nodes data dir and rejoin the cluster A. But nothing changed, they still
>> come back to the old cluster(Cluster B).
>>
>>
>> Here is the "/master/slaves" response:
>>
>> Cluster A:
>>
>> {
>>>   "slaves": [
>>> {
>>>   "active": true,
>>>   "attributes": {
>>> "apps": "logstash",
>>> "colo": "cn5",
>>> "type": "prod"
>>>   },
>>>   "hostname": "l-bu128g9-10k10.ops.cn2.qunar.com",
>>>   "id": "3e7ba6b1-29fd-44e8-9be2-f72896054ac6-S5",
>>>   "pid": "slave(1)@10.90.5.23:5051",
>>>   "registered_time": 1451990379.49813,
>>>   "reregistered_time": 1452093251.39516,
>>>   "resources": {
>>> "cpus": 32,
>>> "disk": 2728919,
>>> "mem": 128126,
>>> "ports": "[8100-1, 31000-32000]"
>>>   }
>>> },
>>>
>>>
>> Cluster B:
>>
>> {
>>>   "slaves": [
>>> {
>>>   "active": false,
>>>   "attributes": {
>>> "apps": "logstash",
>>> "colo": "cn5",
>>> "type": "prod"
>>>   },
>>>   "hostname": "l-bu128g5-10k10.ops.cn2.qunar.com",
>>>   "id": "3e7ba6b1-29fd-44e8-9be2-f72896054ac6-S2",
>>>   "offered_resources": {
>>> "cpus": 0,
>>> "disk": 0,
>>> "mem": 0
>>>   },
>>>   "pid": "slave(1)@10.90.5.19:5051",
>>>   "registered_time": 1451988622.66323,
>>>   "reserved_resources": {},
>>>   "resources": {
>>> "cpus": 32.0,
>>> "disk": 2728919.0,
>>> "mem": 128126.0,
>>> "ports": "[8100-1, 31000-32000]"
>>>   },
>>>   "unreserved_resources": {
>>> "cpus": 32.0,
>>> "disk": 2728919.0,
>>> "mem": 128126.0,
>>> "ports": "[8100-1, 31000-32000]"
>>>   },
>>>   "used_resources": {
>>> "cpus": 0,
>>> "disk": 0,
>>> "mem": 0
>>>   }
>>> },
>>> .
>>>
>>>
>>
>> I found some useful logs:
>>
>>
>>> I0105 18:36:22.683724 6452 slave.cpp:2248] Updated checkpointed
>>> resources from to
>>> I0105 18:37:09.900497 6459 slave.cpp:3926] Current disk usage 0.06%. Max
>>> allowed age: 1.798706758587755days
>>> I0105 18:37:22.678374 6453 slave.cpp:3146] Master marked the slave as
>>> disconnected but the slave considers itself registered! Forcing
>>> re-registration.
>>> I0105 18:37:22.678699 6453 slave.cpp:694] Re-detecting master
>>> I0105 18:37:22.678715 6471 status_update_manager.cpp:176] Pausing
>>> sending status updates
>>> I0105 18:37:22.678753 6453 slave.cpp:741] Detecting new master
>>> I0105 18:37:22.678977 6456 status_update_manager.cpp:176] Pausing
>>> sending status updates
>>> I0105 18:37:22.679047 6455 slave.cpp:705] New master detected at
>>> master@10.88.169.195:5050
>>> I0105 18:37:22.679108 6455 slave.cpp:768] Authenticating with master
>>> master@10.88.169.195:5050
>>> I0105 18:37:22.679136 6455 slave.cpp:773] Using default CRAM-MD5
>>> authenticatee
>>> I0105 

slave nodes are living in two cluster and can not remove correctly.

2016-01-14 Thread X Brick
Hi folks,

I meet a very strange issue when I migrated two nodes from one cluster to
another about one week ago.

Two nodes:

l-bu128g3-10k10.ops.cn2
l-bu128g5-10k10.ops.cn2

I did not clean the mesos data dir before they join the another cluster,
then I found the nodes live in two cluster at the same time.

Cluster A (Mesos 0.22):


Cluster B (Mesos 0.25):


​
​
I thought maybe the old data make these happened, so I clear up these two
nodes data dir and rejoin the cluster A. But nothing changed, they still
come back to the old cluster(Cluster B).


Here is the "/master/slaves" response:

Cluster A:

{
>   "slaves": [
> {
>   "active": true,
>   "attributes": {
> "apps": "logstash",
> "colo": "cn5",
> "type": "prod"
>   },
>   "hostname": "l-bu128g9-10k10.ops.cn2.qunar.com",
>   "id": "3e7ba6b1-29fd-44e8-9be2-f72896054ac6-S5",
>   "pid": "slave(1)@10.90.5.23:5051",
>   "registered_time": 1451990379.49813,
>   "reregistered_time": 1452093251.39516,
>   "resources": {
> "cpus": 32,
> "disk": 2728919,
> "mem": 128126,
> "ports": "[8100-1, 31000-32000]"
>   }
> },
>
>
Cluster B:

{
>   "slaves": [
> {
>   "active": false,
>   "attributes": {
> "apps": "logstash",
> "colo": "cn5",
> "type": "prod"
>   },
>   "hostname": "l-bu128g5-10k10.ops.cn2.qunar.com",
>   "id": "3e7ba6b1-29fd-44e8-9be2-f72896054ac6-S2",
>   "offered_resources": {
> "cpus": 0,
> "disk": 0,
> "mem": 0
>   },
>   "pid": "slave(1)@10.90.5.19:5051",
>   "registered_time": 1451988622.66323,
>   "reserved_resources": {},
>   "resources": {
> "cpus": 32.0,
> "disk": 2728919.0,
> "mem": 128126.0,
> "ports": "[8100-1, 31000-32000]"
>   },
>   "unreserved_resources": {
> "cpus": 32.0,
> "disk": 2728919.0,
> "mem": 128126.0,
> "ports": "[8100-1, 31000-32000]"
>   },
>   "used_resources": {
> "cpus": 0,
> "disk": 0,
> "mem": 0
>   }
> },
> .
>
>

I found some useful logs:


> I0105 18:36:22.683724 6452 slave.cpp:2248] Updated checkpointed resources
> from to
> I0105 18:37:09.900497 6459 slave.cpp:3926] Current disk usage 0.06%. Max
> allowed age: 1.798706758587755days
> I0105 18:37:22.678374 6453 slave.cpp:3146] Master marked the slave as
> disconnected but the slave considers itself registered! Forcing
> re-registration.
> I0105 18:37:22.678699 6453 slave.cpp:694] Re-detecting master
> I0105 18:37:22.678715 6471 status_update_manager.cpp:176] Pausing sending
> status updates
> I0105 18:37:22.678753 6453 slave.cpp:741] Detecting new master
> I0105 18:37:22.678977 6456 status_update_manager.cpp:176] Pausing sending
> status updates
> I0105 18:37:22.679047 6455 slave.cpp:705] New master detected at
> master@10.88.169.195:5050
> I0105 18:37:22.679108 6455 slave.cpp:768] Authenticating with master
> master@10.88.169.195:5050
> I0105 18:37:22.679136 6455 slave.cpp:773] Using default CRAM-MD5
> authenticatee
> I0105 18:37:22.679239 6455 slave.cpp:741] Detecting new master
> I0105 18:37:22.679354 6464 authenticatee.cpp:115] Creating new client SASL
> connection
> I0105 18:37:22.680883 6461 authenticatee.cpp:206] Received SASL
> authentication mechanisms: CRAM-MD5
> I0105 18:37:22.680946 6461 authenticatee.cpp:232] Attempting to
> authenticate with mechanism 'CRAM-MD5'
> I0105 18:37:22.681759 6455 authenticatee.cpp:252] Received SASL
> authentication step
> I0105 18:37:22.682874 6454 authenticatee.cpp:292] Authentication success
> I0105 18:37:22.682986 6441 slave.cpp:836] Successfully authenticated with
> master master@10.88.169.195:5050
> I0105 18:37:22.684303 6454 slave.cpp:980] Re-registered with master
> master@10.88.169.195:5050
> I0105 18:37:22.684455 6454 slave.cpp:1016] Forwarding total oversubscribed
> resources
> I0105 18:37:22.684471 6468 status_update_manager.cpp:183] Resuming sending
> status updates
> I0105 18:37:22.684649 6454 slave.cpp:2152] Updating framework
> 20150610-204949-3299432458-5050-25057- pid to
> scheduler-1bef8172-5068-44c6-93f5-e97a3910ed79@10.88.169.195:35708
> I0105 18:37:22.685025 6452 status_update_manager.cpp:183] Resuming sending
> status updates
> I0105 18:37:22.685117 6454 slave.cpp:2248] Updated checkpointed resources
> from to
> I0105 18:38:09.901587 6464 slave.cpp:3926] Current disk usage 0.06%. Max
> allowed age: 1.798706755730266days
> I0105 18:38:22.679468 6451 slave.cpp:3146] Master marked the slave as
> disconnected but the slave considers itself registered! Forcing
> re-registration.
> I0105 18:38:22.679739 6451 slave.cpp:694] Re-detecting master
> I0105 18:38:22.679754 6453 status_update_manager.cpp:176] Pausing sending
> status updates
> I0105 18:38:22.679785 6451 slave.cpp:741] Detecting new master
> I0105 18:38:22.680054 6461 slave.cpp:705] New master detected at
> 

Re: Help needed (alas, urgently)

2016-01-14 Thread Tim Chen
Hi Paul,

Looks like we've already issued the docker stop as you seen in the ps
output, but the containers are still running. Can you look at the Docker
daemon logs and see what's going on there?

And also can you also try to modify docker_stop_timeout to 0 so that we
SIGKILL the containers right away, and see if this still happens?

Tim



On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote:

> Hi All,
>
> It's been quite some time since I've posted here and that's chiefly
> because up until a day or two ago, things were working really well.
>
> I actually may have posted about this some time back. But then the problem
> seemed more intermittent.
>
> In summa, several "docker stops" don't work, i.e., the containers are not
> stopped.
>
> Deployment:
>
> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
> Zookeeper
> Mesos-master (0.23.0)
> Mesos-slave (0.23.0)
> Marathon (0.10.0)
> Docker 1.9.1
> Weave 1.1.0
> Our application contains which include
> MongoDB (4)
> PostGres
> ECX (our product)
>
> The only thing that's changed at all in the config above is the version of
> Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
> problem.
>
>
> My automater program stops the application by sending Marathon an "http
> delete" for each running up. Every now & then (reliably reproducible today)
> not all containers get stopped. Most recently, 3 containers failed to stop.
>
> Here are the attendant phenomena:
>
> Marathon shows the 3 applications in deployment mode (presumably
> "deployment" in the sense of "stopping")
>
> *ps output:*
>
> root@71:~# ps -ef | grep docker
> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
> --hostname=71.100.202.99 --ip=71.100.202.99
> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
> --stop_timeout=15secs
> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --stop_timeout=15secs
> root  7640  4967  0 14:01 ?    00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
> --stop_timeout=15secs
> *root  9696  9695  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
> *root  9709  9708  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi Tim,

Things have gotten slightly odder (if that's possible). When I now start
the application 5 or so containers, only one "ecxconfigdb" gets started -
and even he took a few tries. That is, I see him failing, moving to
deploying, then starting again. But I've no evidence (no STDOUT, and no
docker ctr logs) that show why.

In any event, ecxconfigdb does start. Happily, when I try to stop the
application I am seeing the phenomena I posted before: killing docker task,
shutting down repeated many times. The UN-stopped container is now running
at 100% CPU.

I will try modifying docker_stop_timeout. Back shortly

Thanks again.

-Paul

PS: what do you make of the "broken pipe" error in the docker.log?

*from /var/log/upstart/docker.log*

[34mINFO[3054] GET /v1.15/images/mongo:2.6.8/json
INFO[3054] GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
ERRO[3054] Handler for GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
returned error: No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
ERRO[3054] HTTP Error
 err=No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
statusCode=404
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] POST
/v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
INFO[3111] GET /v1.21/containers/json
INFO[3120] GET /v1.21/containers/cf7/json
INFO[3120] GET
/v1.21/containers/cf7/logs?stderr=1=1=all
INFO[3153] GET /containers/json
INFO[3153] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3153] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3153] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
INFO[3175] GET /containers/json
INFO[3175] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3175] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3175] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
*INFO[3175] POST
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
?t=15
*ERRO[3175] attach: stdout: write unix @: broken pipe*
*INFO[3190] Container
cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
exit within 15 seconds of SIGTERM - using the force *
*INFO[3200] Container cf7fc7c48324 failed to exit within 10
seconds of kill - trying direct SIGKILL *

*STDOUT from Mesos:*

*--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
*--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"
--sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
--stop_timeout="15secs"
--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"
--sandbox_dire

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hey Tim,

Thank you very much for your reply.

Yes, I am in the midst of trying to reproduce the problem. If successful
(so to speak), I will do as you ask.

Cordially,

Paul

On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Paul,
>
> Looks like we've already issued the docker stop as you seen in the ps
> output, but the containers are still running. Can you look at the Docker
> daemon logs and see what's going on there?
>
> And also can you also try to modify docker_stop_timeout to 0 so that we
> SIGKILL the containers right away, and see if this still happens?
>
> Tim
>
>
>
> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> It's been quite some time since I've posted here and that's chiefly
>> because up until a day or two ago, things were working really well.
>>
>> I actually may have posted about this some time back. But then the
>> problem seemed more intermittent.
>>
>> In summa, several "docker stops" don't work, i.e., the containers are not
>> stopped.
>>
>> Deployment:
>>
>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>> Zookeeper
>> Mesos-master (0.23.0)
>> Mesos-slave (0.23.0)
>> Marathon (0.10.0)
>> Docker 1.9.1
>> Weave 1.1.0
>> Our application contains which include
>> MongoDB (4)
>> PostGres
>> ECX (our product)
>>
>> The only thing that's changed at all in the config above is the version
>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>> problem.
>>
>>
>> My automater program stops the application by sending Marathon an "http
>> delete" for each running up. Every now & then (reliably reproducible today)
>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>
>> Here are the attendant phenomena:
>>
>> Marathon shows the 3 applications in deployment mode (presumably
>> "deployment" in the sense of "stopping")
>>
>> *ps output:*
>>
>> root@71:~# ps -ef | grep docker
>> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
>> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>> --hostname=71.100.202.99 --ip=71.100.202.99
>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>> --stop_timeout=15secs
>> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --stop_timeout=15secs
>> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/m

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
I spoke to soon, I'm afraid.

Next time I did the stop (with zero timeout), I see the same phenomenon: a
mongo container showing repeated:

killing docker task
shutting down


What else can I try?

Thank you.

On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote:

> Hi Tim,
>
> I set docker_stop_timeout to zero as you asked. I am pleased to report
> (though a bit fearful about being pleased) that this change seems to have
> shut everyone down pretty much instantly.
>
> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
> the immediate use of "kill -9" as opposed to "kill -2"?
>
> I will keep testing the behavior.
>
> Thank you.
>
> -Paul
>
> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> Things have gotten slightly odder (if that's possible). When I now start
>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>> and even he took a few tries. That is, I see him failing, moving to
>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>> docker ctr logs) that show why.
>>
>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>> application I am seeing the phenomena I posted before: killing docker task,
>> shutting down repeated many times. The UN-stopped container is now running
>> at 100% CPU.
>>
>> I will try modifying docker_stop_timeout. Back shortly
>>
>> Thanks again.
>>
>> -Paul
>>
>> PS: what do you make of the "broken pipe" error in the docker.log?
>>
>> *from /var/log/upstart/docker.log*
>>
>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>> [34mINFO [0m[3054] GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [31mERRO [0m[3054] Handler for GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> returned error: No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mERRO [0m[3054] HTTP Error     [31merr
>> [0m=No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mstatusCode [0m=404
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [34mINFO [0m[3111] GET /v1.21/containers/json
>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>> [34mINFO [0m[3120] GET
>> /v1.21/containers/cf7/logs?stderr=1=1=all
>> [34mINFO [0m[3153] GET /containers/json
>> [34mINFO [0m[3153] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3153] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3153] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> [34mINFO [0m[3175] GET /containers/json
>> [34mINFO [0m[3175] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3175] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3175] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> * [34mI