Re: orphaned thermos
great. would be awesome if this can be included in 0.18.1. On Mon, Oct 30, 2017 at 3:18 PM, Erb, Stephan wrote: > This problem reminds me of a patch I have added for the observer to commit > suicide on unhandled errors in threads. See https://github.com/apache/ > aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af > > > > I will prepare a similar patch for the executor. > > > > *From: *Bill Farner > *Reply-To: *"user@aurora.apache.org" > *Date: *Friday, 27. October 2017 at 05:34 > *To: *"user@aurora.apache.org" > *Subject: *Re: orphaned thermos > > > > If the executor runs out of memory, i think it should be assumed that it > will no longer be well-behaved. It seems most sensible for the mesos agent > to clean up in this case. > > > > On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi > wrote: > > We found several zombie executors on a cluster. Thermos logs indicate > reaching system limits while trying to shutdown(?). Mesos agent is unable > to get status of this container from docker daemon (docker inspect fails). > Shouldn't thermos exit in such a case? > > > > > > 22 WARNING: Your kernel does not support swap limit capabilities, memory > limited without swap. > > 23 twitter.common.app debug: Initializing: twitter.common.log (Logging > subsystem.) > > 24 Writing log files to disk in /mnt/mesos/sandbox > > 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 > > 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent > b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 > > 27 Writing log files to disk in /mnt/mesos/sandbox > > 28 Traceback (most recent call last): > > 29 File > "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 126, in _excepting_run > > 30 self.__real_run(*args, **kw) > > 31 File "apache/thermos/monitoring/resource.py", line 243, in run > > 32 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", > line 79, in wait > > 33 thread.start() > > 34 File "/usr/lib/python2.7/threading.py", line 745, in start > > 35 _start_new_thread(self.__bootstrap, ()) > > 36 thread.error: can't start new thread > > 37 ERROR] *Failed to stop health checkers:* > > 38 ERROR] Traceback (most recent call last): > > 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown > > 40 propagate_deadline(self._chained_checker.stop, > timeout=self.STOP_TIMEOUT) > > 41 File "apache/aurora/executor/aurora_executor.py", line 35, in > propagate_deadline > > 42 return deadline(*args, daemon=True, propagate=True, **kw) > > 43 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", > line 61, in deadline > > 44 AnonymousThread().start() > > 45 File "/usr/lib/python2.7/threading.py", line 745, in start > > 46 _start_new_thread(self.__bootstrap, ()) > > 47 *error: can't start new thread* > > 48 > > 49 ERROR]* Failed to stop runner:* > > 50 ERROR] Traceback (most recent call last): > > 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown > > 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) > > 53 File "apache/aurora/executor/aurora_executor.py", line 35, in > propagate_deadline > > 54 return deadline(*args, daemon=True, propagate=True, **kw) > > 55 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", > line 61, in deadline > > 56 AnonymousThread().start() > > 57 File "/usr/lib/python2.7/threading.py", line 745, in start > > 58 _start_new_thread(self.__bootstrap, ()) > > 59 *error: can't start new thread* > > 60 > > 61 Traceback (most recent call last): > > 62 File > "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__ini
Re: orphaned thermos
This problem reminds me of a patch I have added for the observer to commit suicide on unhandled errors in threads. See https://github.com/apache/aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af I will prepare a similar patch for the executor. From: Bill Farner Reply-To: "user@aurora.apache.org" Date: Friday, 27. October 2017 at 05:34 To: "user@aurora.apache.org" Subject: Re: orphaned thermos If the executor runs out of memory, i think it should be assumed that it will no longer be well-behaved. It seems most sensible for the mesos agent to clean up in this case. On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi mailto:mohit.ja...@uber.com>> wrote: We found several zombie executors on a cluster. Thermos logs indicate reaching system limits while trying to shutdown(?). Mesos agent is unable to get status of this container from docker daemon (docker inspect fails). Shouldn't thermos exit in such a case? 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap. 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.) 24 Writing log files to disk in /mnt/mesos/sandbox 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 27 Writing log files to disk in /mnt/mesos/sandbox 28 Traceback (most recent call last): 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run 30 self.__real_run(*args, **kw) 31 File "apache/thermos/monitoring/resource.py", line 243, in run 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", line 79, in wait 33 thread.start() 34 File "/usr/lib/python2.7/threading.py", line 745, in start 35 _start_new_thread(self.__bootstrap, ()) 36 thread.error: can't start new thread 37 ERROR] Failed to stop health checkers: 38 ERROR] Traceback (most recent call last): 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT) 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline 42 return deadline(*args, daemon=True, propagate=True, **kw) 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 61, in deadline 44 AnonymousThread().start() 45 File "/usr/lib/python2.7/threading.py", line 745, in start 46 _start_new_thread(self.__bootstrap, ()) 47 error: can't start new thread 48 49 ERROR] Failed to stop runner: 50 ERROR] Traceback (most recent call last): 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline 54 return deadline(*args, daemon=True, propagate=True, **kw) 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 61, in deadline 56 AnonymousThread().start() 57 File "/usr/lib/python2.7/threading.py", line 745, in start 58 _start_new_thread(self.__bootstrap, ()) 59 error: can't start new thread 60 61 Traceback (most recent call last): 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run 63 self.__real_run(*args, **kw) 64 File "apache/aurora/executor/status_manager.py", line 62, in run 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 56, in defer 67 deferred.start() 68 File "/usr/lib/python2.7/threading.py", line 745, in start 69 _start_new_thread(self.__bootstrap, ()) 70 thread.error: can't start new thread
Re: orphaned thermos
If the executor runs out of memory, i think it should be assumed that it will no longer be well-behaved. It seems most sensible for the mesos agent to clean up in this case. On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi wrote: > We found several zombie executors on a cluster. Thermos logs indicate > reaching system limits while trying to shutdown(?). Mesos agent is unable > to get status of this container from docker daemon (docker inspect fails). > Shouldn't thermos exit in such a case? > > > 22 WARNING: Your kernel does not support swap limit capabilities, memory > limited without swap. > 23 twitter.common.app debug: Initializing: twitter.common.log (Logging > subsystem.) > 24 Writing log files to disk in /mnt/mesos/sandbox > 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 > 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent > b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 > 27 Writing log files to disk in /mnt/mesos/sandbox > 28 Traceback (most recent call last): > 29 File > "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 126, in _excepting_run > 30 self.__real_run(*args, **kw) > 31 File "apache/thermos/monitoring/resource.py", line 243, in run > 32 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", > line 79, in wait > 33 thread.start() > 34 File "/usr/lib/python2.7/threading.py", line 745, in start > 35 _start_new_thread(self.__bootstrap, ()) > 36 thread.error: can't start new thread > 37 ERROR] *Failed to stop health checkers:* > 38 ERROR] Traceback (most recent call last): > 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown > 40 propagate_deadline(self._chained_checker.stop, > timeout=self.STOP_TIMEOUT) > 41 File "apache/aurora/executor/aurora_executor.py", line 35, in > propagate_deadline > 42 return deadline(*args, daemon=True, propagate=True, **kw) > 43 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", > line 61, in deadline > 44 AnonymousThread().start() > 45 File "/usr/lib/python2.7/threading.py", line 745, in start > 46 _start_new_thread(self.__bootstrap, ()) > 47 *error: can't start new thread* > > 48 > > 49 ERROR]* Failed to stop runner:* > 50 ERROR] Traceback (most recent call last): > 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown > 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) > 53 File "apache/aurora/executor/aurora_executor.py", line 35, in > propagate_deadline > 54 return deadline(*args, daemon=True, propagate=True, **kw) > 55 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", > line 61, in deadline > 56 AnonymousThread().start() > 57 File "/usr/lib/python2.7/threading.py", line 745, in start > 58 _start_new_thread(self.__bootstrap, ()) > 59 *error: can't start new thread > * 60 > 61 Traceback (most recent call last): > 62 File > "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 126, in _excepting_run > 63 self.__real_run(*args, **kw) > 64 File "apache/aurora/executor/status_manager.py", line 62, in run > 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown > 66 File > "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", > line 56, in defer > 67 deferred.start() > 68 File "/usr/lib/python2.7/threading.py", line 745, in start > 69 _start_new_thread(self.__bootstrap, ()) > 70* thread.error: can't start new thread* > >
orphaned thermos
We found several zombie executors on a cluster. Thermos logs indicate reaching system limits while trying to shutdown(?). Mesos agent is unable to get status of this container from docker daemon (docker inspect fails). Shouldn't thermos exit in such a case? 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap. 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.) 24 Writing log files to disk in /mnt/mesos/sandbox 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 27 Writing log files to disk in /mnt/mesos/sandbox 28 Traceback (most recent call last): 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run 30 self.__real_run(*args, **kw) 31 File "apache/thermos/monitoring/resource.py", line 243, in run 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", line 79, in wait 33 thread.start() 34 File "/usr/lib/python2.7/threading.py", line 745, in start 35 _start_new_thread(self.__bootstrap, ()) 36 thread.error: can't start new thread 37 ERROR] *Failed to stop health checkers:* 38 ERROR] Traceback (most recent call last): 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT) 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline 42 return deadline(*args, daemon=True, propagate=True, **kw) 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 61, in deadline 44 AnonymousThread().start() 45 File "/usr/lib/python2.7/threading.py", line 745, in start 46 _start_new_thread(self.__bootstrap, ()) 47 *error: can't start new thread* 48 49 ERROR]* Failed to stop runner:* 50 ERROR] Traceback (most recent call last): 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline 54 return deadline(*args, daemon=True, propagate=True, **kw) 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 61, in deadline 56 AnonymousThread().start() 57 File "/usr/lib/python2.7/threading.py", line 745, in start 58 _start_new_thread(self.__bootstrap, ()) 59 *error: can't start new thread * 60 61 Traceback (most recent call last): 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run 63 self.__real_run(*args, **kw) 64 File "apache/aurora/executor/status_manager.py", line 62, in run 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 56, in defer 67 deferred.start() 68 File "/usr/lib/python2.7/threading.py", line 745, in start 69 _start_new_thread(self.__bootstrap, ()) 70* thread.error: can't start new thread*