Hi, Thanks for your feedback. I checked the kernel logs and they don't show anything. The Thermos logs I had already checked. They don't show any activity, as if the request never reached Thermos.
Kind Regards, Matthias Am 20.01.2016 um 03:14 schrieb Benjamin Mahler: > From the slave (now known as agent) logs: > > I0114 14:09:51.297840 23049 slave.cpp:3967] Sending reconnect request > to executor > thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 > of framework 20150930-134812-84017418-5050-29407-0001 at > executor(1)@NET.10:57730 > I0114 14:09:53.298254 23050 slave.cpp:2638] Killing un-reregistered > executor > 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' > of framework 20150930-134812-84017418-5050-29407-0001 > > This means we didn't receive a response from the executor after the 2 > second reconnect timeout so it was killed. Have you checked the kernel > logs during this time frame? Were you changing mesos versions around > this event? Have you checked thermos' logs? > > On Thu, Jan 14, 2016 at 8:10 AM, Matthias Bach > <[email protected] <mailto:[email protected]>> > wrote: > > Hi all, > > We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we > have been using the JSON format for Mesos' credential files. However, > because of MESOS-3695 we decided to switch to the plain text format > before updating to 0.24.1. Our understanding is that this should be a > NOOP. However, on our cluster this caused multiple tasks to fail > on each > slave. > > I have attached two excerpts from the Mesos slave log. One were I > grepped for the executor ID of one of the failed tasks, and one were I > grepped for the ID of the corresponding container. What you can see is > that recovery of the container is started and – immediately > afterwards > – the executer killed. > > Our change procedure was: > * Place the new plain-text credential file > * Restart the slave with `--credential` pointing to the new file > * Remove the old JSON credential file > > We are running the Mesos slave using supervisord and use the following > isolators: cgroups/cpu, cgroups/mem, filesystem/shared, > namespaces/pid, > and posix/disk. In addition we use `--enforce_container_disk_quota`. > Regarding recovery we use the options `--recover="reconnect"` and > `--strict="false"`. > > The Thermos log does not provide any hints as to what happened. It > looks > like Thermos was SIGKILLed. > > Has any of you run into this problem before? Do you have an idea what > could cause this behaviour? Do you have any suggestion what > information > we could look for to better understand what happens? > > Kind Regards, > Matthias > > -- > Dr. Matthias Bach > Senior Software Engineer > *Blue Yonder GmbH* > Ohiostraße 8 > D-76149 Karlsruhe > > Tel +49 (0)721 383 117 6244 > Fax +49 (0)721 383 117 69 > > [email protected] <mailto:[email protected]> > www.blue-yonder.com <http://www.blue-yonder.com> > Registergericht Mannheim, HRB 704547 > USt-IdNr. DE DE 277 091 535 > Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO) > > -- Dr. Matthias Bach Senior Software Engineer *Blue Yonder GmbH* Ohiostraße 8 D-76149 Karlsruhe Tel +49 (0)721 383 117 6244 Fax +49 (0)721 383 117 69 [email protected] <mailto:[email protected]> www.blue-yonder.com <http://www.blue-yonder.com/> Registergericht Mannheim, HRB 704547 USt-IdNr. DE DE 277 091 535 Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)

