Re: Sigkill while running mesos agent (1.0.1) in docker

haosdent Mon, 16 Jan 2017 06:38:13 -0800

As the log show, it failed when perform below command to find the container
status.
```


docker -H unix:///var/run/docker.sock inspect
mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c

```

have you mount the sock file from host to your agent container?

On Fri, Jan 13, 2017 at 8:20 PM, Giulio Eulisse <[email protected]>
wrote:

>
> Actually, no. The docker containers seem to be running just fine. Looks
> like mesos is not able to notice that. Did anything change in the way mesos
> looks up for them? Notice I've both renamed my container to "agent" and
> added MESOS_DOCKER_KILL_ORPHANS=false.
>
>
>
> On 13 Jan 2017, 02:14 +0100, haosdent <[email protected]>, wrote:
>
> Is it caused by your container riemann-elasticsearch could not start
> successfully?
>
> On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse <[email protected]>
> wrote:
>
>> MMm... it improved things, but now I get a bunch of:
>>
>> ```
>> W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource
>> statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9
>> 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-0000:
>> Failed to run 'docker -H unix:///var/run/docker.sock inspect me
>> sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c':
>> exited with status 1; stderr='Error: No such image,
>>  container or task: mesos-498ff8de-782e-482a-9478-
>> 69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c
>> ```
>>
>> and then leaves out a bunch of running containers.
>>
>> On 13 Jan 2017, 01:51 +0100, Joseph Wu <[email protected]>, wrote:
>>
>> If Apache JIRA were up, I'd point you to a JIRA noting the problem with
>> naming docker containers `mesos-*`, as Mesos reserves that prefix (and
>> kills everything it considers "unknown").
>>
>> As a quick workaround, try setting this flag to false:
>> https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596
>>
>> On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse <[email protected]
>> > wrote:
>>
>>> MMm... it seems to die after a long sequence of forks, and mesos itself
>>> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup
>>> and it does not realise one of the containers is the agent itself??? Notice
>>> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
>>>
>>> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse <[email protected]>,
>>> wrote:
>>>
>>> Ciao,
>>>
>>> the only thing I could find is by running a parallel `docker events`
>>>
>>> ```
>>> 2017-01-13T01:18:20.766593692+01:00 network connect
>>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>>> name=host, type=host)
>>> 2017-01-13T01:18:20.846137793+01:00 container start
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, vendor=CentOS)
>>> 2017-01-13T01:18:20.847965921+01:00 container resize
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1,
>>> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
>>> 2017-01-13T01:18:21.610141857+01:00 container kill
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, signal=15, vendor=CentOS)
>>> 2017-01-13T01:18:21.610491564+01:00 container kill
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, signal=9, vendor=CentOS)
>>> 2017-01-13T01:18:21.646229213+01:00 container die
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1,
>>> license=GPLv2, name=mesos-slave, vendor=CentOS)
>>> 2017-01-13T01:18:21.652894124+01:00 network disconnect
>>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
>>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
>>> name=host, type=host)
>>> 2017-01-13T01:18:21.705874041+01:00 container stop
>>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
>>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
>>> name=mesos-slave, vendor=CentOS)
>>> ```
>>>
>>> Ciao,
>>> Giulio
>>>
>>> On 13 Jan 2017, 01:06 +0100, haosdent <[email protected]>, wrote:
>>>
>>> Hi, @Giuliio According to your log, it looks normal. Do you have any
>>> logs related to "SIGKILL"?
>>>
>>> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I’ve a setup where I run mesos in docker which works perfectly when I
>>>> use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and
>>>> 1.0.0) and it seems to receive a sigkill right after saying:
>>>>
>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>> I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by 
>>>> centos
>>>> I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
>>>> I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
>>>> I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
>>>> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
>>>> W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections will 
>>>> be downgraded to a non-SSL socket
>>>> W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections will 
>>>> be downgraded to a non-SSL socket
>>>> E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 2>&1' 
>>>> failed; this is the output:
>>>> sh: hadoop: command not found
>>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client 
>>>> environment:zookeeper.version=zookeeper C client 3.4.8
>>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client 
>>>> environment:host.name=XXXX.XXX.ch
>>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client 
>>>> environment:os.name=Linux
>>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client 
>>>> environment:os.arch=3.10.0-229.14.1.el7.x86_64
>>>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client 
>>>> environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
>>>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client 
>>>> environment:user.name=(null)
>>>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client 
>>>> environment:user.home=/root
>>>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client 
>>>> environment:user.dir=/
>>>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: 
>>>> Initiating client connection, 
>>>> host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 
>>>> sessionTimeout=10000 watcher=0x7f950ee20300 sessionId=0 
>>>> sessionPasswd=<null> context=0x
>>>> 7f94f0000c60 flags=0
>>>> 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728: 
>>>> initiated connection to server [XX.YY.ZZ.WW:2181]
>>>> 2017-01-12 23:22:10,146:4934(0x7f9501fd7700):ZOO_INFO@check_events@1775: 
>>>> session establishment complete on server [XX.YY.ZZ.WW:2181
>>>> ], sessionId=0x35828ae70fb2065, negotiated timeout=10000
>>>>
>>>> Any idea of what might be going on? Looks like an OOM, but I do not see
>>>> it in /var/log/messages and it also happens with --oom-kill-disable.
>>>>
>>>> --
>>>> Ciao,
>>>> Giulio
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>


-- 
Best Regards,
Haosdent Huang

Re: Sigkill while running mesos agent (1.0.1) in docker

Reply via email to