Re: Tasks failing when restarting slave on Mesos 0.23.1

Matthias Bach Mon, 25 Jan 2016 01:53:14 -0800

Hi,

Thanks for your feedback. I checked the kernel logs and they don't show
anything. The Thermos logs I had already checked. They don't show any
activity, as if the request never reached Thermos.


Kind Regards,
Matthias

Am 20.01.2016 um 03:14 schrieb Benjamin Mahler:
> From the slave (now known as agent) logs:
>
> I0114 14:09:51.297840 23049 slave.cpp:3967] Sending reconnect request
> to executor
> thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3
> of framework 20150930-134812-84017418-5050-29407-0001 at
> executor(1)@NET.10:57730
> I0114 14:09:53.298254 23050 slave.cpp:2638] Killing un-reregistered
> executor
> 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3'
> of framework 20150930-134812-84017418-5050-29407-0001
>
> This means we didn't receive a response from the executor after the 2
> second reconnect timeout so it was killed. Have you checked the kernel
> logs during this time frame? Were you changing mesos versions around
> this event? Have you checked thermos' logs?
>
> On Thu, Jan 14, 2016 at 8:10 AM, Matthias Bach
> <[email protected] <mailto:[email protected]>>
> wrote:
>
>     Hi all,
>
>     We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we
>     have been using the JSON format for Mesos' credential files. However,
>     because of MESOS-3695 we decided to switch to the plain text format
>     before updating to 0.24.1. Our understanding is that this should be a
>     NOOP. However, on our cluster this caused multiple tasks to fail
>     on each
>     slave.
>
>     I have attached two excerpts from the Mesos slave log. One were I
>     grepped for the executor ID of one of the failed tasks, and one were I
>     grepped for the ID of the corresponding container. What you can see is
>     that recovery of the container  is started and – immediately
>     afterwards
>     – the executer killed.
>
>     Our change procedure was:
>     * Place the new plain-text credential file
>     * Restart the slave with `--credential` pointing to the new file
>     * Remove the old JSON credential file
>
>     We are running the Mesos slave using supervisord and use the following
>     isolators: cgroups/cpu, cgroups/mem, filesystem/shared,
>     namespaces/pid,
>     and posix/disk. In addition we use `--enforce_container_disk_quota`.
>     Regarding recovery we use the options `--recover="reconnect"` and
>     `--strict="false"`.
>
>     The Thermos log does not provide any hints as to what happened. It
>     looks
>     like Thermos was SIGKILLed.
>
>     Has any of you run into this problem before? Do you have an idea what
>     could cause this behaviour? Do you have any suggestion what
>     information
>     we could look for to better understand what happens?
>
>     Kind Regards,
>     Matthias
>
>     --
>     Dr. Matthias Bach
>     Senior Software Engineer
>     *Blue Yonder GmbH*
>     Ohiostraße 8
>     D-76149 Karlsruhe
>
>     Tel +49 (0)721 383 117 6244
>     Fax +49 (0)721 383 117 69
>
>     [email protected] <mailto:[email protected]>
>     www.blue-yonder.com <http://www.blue-yonder.com>
>     Registergericht Mannheim, HRB 704547
>     USt-IdNr. DE DE 277 091 535
>     Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)
>
>

-- 

Dr. Matthias Bach
Senior Software Engineer
*Blue Yonder GmbH*
Ohiostraße 8
D-76149 Karlsruhe 

Tel +49 (0)721 383 117 6244
Fax +49 (0)721 383 117 69

[email protected] <mailto:[email protected]>
www.blue-yonder.com <http://www.blue-yonder.com/>
Registergericht Mannheim, HRB 704547
USt-IdNr. DE DE 277 091 535
Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)

Re: Tasks failing when restarting slave on Mesos 0.23.1

Reply via email to