This problem reminds me of a patch I have added for the observer to commit 
suicide on unhandled errors in threads. See 
https://github.com/apache/aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af

I will prepare a similar patch for the executor.

From: Bill Farner <wfar...@apache.org>
Reply-To: "user@aurora.apache.org" <user@aurora.apache.org>
Date: Friday, 27. October 2017 at 05:34
To: "user@aurora.apache.org" <user@aurora.apache.org>
Subject: Re: orphaned thermos

If the executor runs out of memory, i think it should be assumed that it will 
no longer be well-behaved.  It seems most sensible for the mesos agent to clean 
up in this case.

On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi 
<mohit.ja...@uber.com<mailto:mohit.ja...@uber.com>> wrote:
We found several zombie executors on a cluster. Thermos logs indicate reaching 
system limits while trying to shutdown(?). Mesos agent is unable to get status 
of this container from docker daemon (docker inspect fails). Shouldn't thermos 
exit in such a case?



 22 WARNING: Your kernel does not support swap limit capabilities, memory 
limited without swap.

 23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
subsystem.)

 24 Writing log files to disk in /mnt/mesos/sandbox

 25 I1023 19:04:32.261165     7 exec.cpp:162] Version: 1.2.0

 26 I1023 19:04:32.264870    42 exec.cpp:237] Executor registered on agent 
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295

 27 Writing log files to disk in /mnt/mesos/sandbox

 28 Traceback (most recent call last):

 29   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 1    26, in _excepting_run

 30     self.__real_run(*args, **kw)

 31   File "apache/thermos/monitoring/resource.py", line 243, in run

 32   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
 lin    e 79, in wait

 33     thread.start()

 34   File "/usr/lib/python2.7/threading.py", line 745, in start

 35     _start_new_thread(self.__bootstrap, ())

 36 thread.error: can't start new thread

 37 ERROR] Failed to stop health checkers:

 38 ERROR] Traceback (most recent call last):

 39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown

 40     propagate_deadline(self._chained_checker.stop, 
timeout=self.STOP_TIMEOUT)

 41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline

 42     return deadline(*args, daemon=True, propagate=True, **kw)

 43   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
 line 6    1, in deadline

 44     AnonymousThread().start()

 45   File "/usr/lib/python2.7/threading.py", line 745, in start

 46     _start_new_thread(self.__bootstrap, ())

 47 error: can't start new thread

48

 49 ERROR] Failed to stop runner:

50 ERROR] Traceback (most recent call last):

 51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown

 52     propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)

 53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline

 54     return deadline(*args, daemon=True, propagate=True, **kw)

 55   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
 line 6    1, in deadline

 56     AnonymousThread().start()

 57   File "/usr/lib/python2.7/threading.py", line 745, in start

 58     _start_new_thread(self.__bootstrap, ())

 59 error: can't start new thread

 60

 61 Traceback (most recent call last):

 62   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 1    26, in _excepting_run

 63     self.__real_run(*args, **kw)

 64   File "apache/aurora/executor/status_manager.py", line 62, in run

 65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown

 66   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
 line 5    6, in defer

 67     deferred.start()

 68   File "/usr/lib/python2.7/threading.py", line 745, in start

 69     _start_new_thread(self.__bootstrap, ())

 70 thread.error: can't start new thread

Reply via email to