>From the slave (now known as agent) logs: I0114 14:09:51.297840 23049 slave.cpp:3967] Sending reconnect request to executor thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001 at executor(1)@NET.10:57730 I0114 14:09:53.298254 23050 slave.cpp:2638] Killing un-reregistered executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
This means we didn't receive a response from the executor after the 2 second reconnect timeout so it was killed. Have you checked the kernel logs during this time frame? Were you changing mesos versions around this event? Have you checked thermos' logs? On Thu, Jan 14, 2016 at 8:10 AM, Matthias Bach < [email protected]> wrote: > Hi all, > > We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we > have been using the JSON format for Mesos' credential files. However, > because of MESOS-3695 we decided to switch to the plain text format > before updating to 0.24.1. Our understanding is that this should be a > NOOP. However, on our cluster this caused multiple tasks to fail on each > slave. > > I have attached two excerpts from the Mesos slave log. One were I > grepped for the executor ID of one of the failed tasks, and one were I > grepped for the ID of the corresponding container. What you can see is > that recovery of the container is started and – immediately afterwards > – the executer killed. > > Our change procedure was: > * Place the new plain-text credential file > * Restart the slave with `--credential` pointing to the new file > * Remove the old JSON credential file > > We are running the Mesos slave using supervisord and use the following > isolators: cgroups/cpu, cgroups/mem, filesystem/shared, namespaces/pid, > and posix/disk. In addition we use `--enforce_container_disk_quota`. > Regarding recovery we use the options `--recover="reconnect"` and > `--strict="false"`. > > The Thermos log does not provide any hints as to what happened. It looks > like Thermos was SIGKILLed. > > Has any of you run into this problem before? Do you have an idea what > could cause this behaviour? Do you have any suggestion what information > we could look for to better understand what happens? > > Kind Regards, > Matthias > > -- > Dr. Matthias Bach > Senior Software Engineer > *Blue Yonder GmbH* > Ohiostraße 8 > D-76149 Karlsruhe > > Tel +49 (0)721 383 117 6244 > Fax +49 (0)721 383 117 69 > > [email protected] > www.blue-yonder.com > Registergericht Mannheim, HRB 704547 > USt-IdNr. DE DE 277 091 535 > Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO) > >

