Re: orphaned thermos

2017-10-30 Thread Mohit Jaggi
great. would be awesome if this can be included in 0.18.1.

On Mon, Oct 30, 2017 at 3:18 PM, Erb, Stephan 
wrote:

> This problem reminds me of a patch I have added for the observer to commit
> suicide on unhandled errors in threads. See https://github.com/apache/
> aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af
>
>
>
> I will prepare a similar patch for the executor.
>
>
>
> *From: *Bill Farner 
> *Reply-To: *"user@aurora.apache.org" 
> *Date: *Friday, 27. October 2017 at 05:34
> *To: *"user@aurora.apache.org" 
> *Subject: *Re: orphaned thermos
>
>
>
> If the executor runs out of memory, i think it should be assumed that it
> will no longer be well-behaved.  It seems most sensible for the mesos agent
> to clean up in this case.
>
>
>
> On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi 
> wrote:
>
> We found several zombie executors on a cluster. Thermos logs indicate
> reaching system limits while trying to shutdown(?). Mesos agent is unable
> to get status of this container from docker daemon (docker inspect fails).
> Shouldn't thermos exit in such a case?
>
>
>
>
>
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>
>  24 Writing log files to disk in /mnt/mesos/sandbox
>
>  25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>
>  26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>
>  27 Writing log files to disk in /mnt/mesos/sandbox
>
>  28 Traceback (most recent call last):
>
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>
>  30 self.__real_run(*args, **kw)
>
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  line 79, in wait
>
>  33 thread.start()
>
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  35 _start_new_thread(self.__bootstrap, ())
>
>  36 thread.error: can't start new thread
>
>  37 ERROR] *Failed to stop health checkers:*
>
>  38 ERROR] Traceback (most recent call last):
>
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>
>  40 propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>
>  42 return deadline(*args, daemon=True, propagate=True, **kw)
>
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>
>  44 AnonymousThread().start()
>
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  46 _start_new_thread(self.__bootstrap, ())
>
>  47 *error: can't start new thread*
>
> 48
>
>  49 ERROR]* Failed to stop runner:*
>
> 50 ERROR] Traceback (most recent call last):
>
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>
>  52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>
>  54 return deadline(*args, daemon=True, propagate=True, **kw)
>
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>
>  56 AnonymousThread().start()
>
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>
>  58 _start_new_thread(self.__bootstrap, ())
>
>  59 *error: can't start new thread*
>
>  60
>
>  61 Traceback (most recent call last):
>
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__ini

Re: orphaned thermos

2017-10-30 Thread Erb, Stephan
This problem reminds me of a patch I have added for the observer to commit 
suicide on unhandled errors in threads. See 
https://github.com/apache/aurora/commit/d56f8c64466a94a990db3308a3130d3fce0584af

I will prepare a similar patch for the executor.

From: Bill Farner 
Reply-To: "user@aurora.apache.org" 
Date: Friday, 27. October 2017 at 05:34
To: "user@aurora.apache.org" 
Subject: Re: orphaned thermos

If the executor runs out of memory, i think it should be assumed that it will 
no longer be well-behaved.  It seems most sensible for the mesos agent to clean 
up in this case.

On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi 
mailto:mohit.ja...@uber.com>> wrote:
We found several zombie executors on a cluster. Thermos logs indicate reaching 
system limits while trying to shutdown(?). Mesos agent is unable to get status 
of this container from docker daemon (docker inspect fails). Shouldn't thermos 
exit in such a case?



 22 WARNING: Your kernel does not support swap limit capabilities, memory 
limited without swap.

 23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
subsystem.)

 24 Writing log files to disk in /mnt/mesos/sandbox

 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0

 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295

 27 Writing log files to disk in /mnt/mesos/sandbox

 28 Traceback (most recent call last):

 29   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run

 30 self.__real_run(*args, **kw)

 31   File "apache/thermos/monitoring/resource.py", line 243, in run

 32   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
 line 79, in wait

 33 thread.start()

 34   File "/usr/lib/python2.7/threading.py", line 745, in start

 35 _start_new_thread(self.__bootstrap, ())

 36 thread.error: can't start new thread

 37 ERROR] Failed to stop health checkers:

 38 ERROR] Traceback (most recent call last):

 39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown

 40 propagate_deadline(self._chained_checker.stop, 
timeout=self.STOP_TIMEOUT)

 41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline

 42 return deadline(*args, daemon=True, propagate=True, **kw)

 43   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
 line 61, in deadline

 44 AnonymousThread().start()

 45   File "/usr/lib/python2.7/threading.py", line 745, in start

 46 _start_new_thread(self.__bootstrap, ())

 47 error: can't start new thread

48

 49 ERROR] Failed to stop runner:

50 ERROR] Traceback (most recent call last):

 51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown

 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)

 53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline

 54 return deadline(*args, daemon=True, propagate=True, **kw)

 55   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
 line 61, in deadline

 56 AnonymousThread().start()

 57   File "/usr/lib/python2.7/threading.py", line 745, in start

 58 _start_new_thread(self.__bootstrap, ())

 59 error: can't start new thread

 60

 61 Traceback (most recent call last):

 62   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run

 63 self.__real_run(*args, **kw)

 64   File "apache/aurora/executor/status_manager.py", line 62, in run

 65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown

 66   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
 line 56, in defer

 67 deferred.start()

 68   File "/usr/lib/python2.7/threading.py", line 745, in start

 69 _start_new_thread(self.__bootstrap, ())

 70 thread.error: can't start new thread



Re: orphaned thermos

2017-10-26 Thread Bill Farner
If the executor runs out of memory, i think it should be assumed that it
will no longer be well-behaved.  It seems most sensible for the mesos agent
to clean up in this case.

On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi  wrote:

> We found several zombie executors on a cluster. Thermos logs indicate
> reaching system limits while trying to shutdown(?). Mesos agent is unable
> to get status of this container from docker daemon (docker inspect fails).
> Shouldn't thermos exit in such a case?
>
>
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>  24 Writing log files to disk in /mnt/mesos/sandbox
>  25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>  26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>  27 Writing log files to disk in /mnt/mesos/sandbox
>  28 Traceback (most recent call last):
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>  30 self.__real_run(*args, **kw)
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  line 79, in wait
>  33 thread.start()
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>  35 _start_new_thread(self.__bootstrap, ())
>  36 thread.error: can't start new thread
>  37 ERROR] *Failed to stop health checkers:*
>  38 ERROR] Traceback (most recent call last):
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>  40 propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  42 return deadline(*args, daemon=True, propagate=True, **kw)
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>  44 AnonymousThread().start()
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>  46 _start_new_thread(self.__bootstrap, ())
>  47 *error: can't start new thread*
>
> 48
>
>  49 ERROR]* Failed to stop runner:*
> 50 ERROR] Traceback (most recent call last):
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>  52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  54 return deadline(*args, daemon=True, propagate=True, **kw)
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>  56 AnonymousThread().start()
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>  58 _start_new_thread(self.__bootstrap, ())
>  59 *error: can't start new thread
> * 60
>  61 Traceback (most recent call last):
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>  63 self.__real_run(*args, **kw)
>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>  66   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>  line 56, in defer
>  67 deferred.start()
>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>  69 _start_new_thread(self.__bootstrap, ())
>  70* thread.error: can't start new thread*
>
>


orphaned thermos

2017-10-26 Thread Mohit Jaggi
We found several zombie executors on a cluster. Thermos logs indicate
reaching system limits while trying to shutdown(?). Mesos agent is unable
to get status of this container from docker daemon (docker inspect fails).
Shouldn't thermos exit in such a case?


 22 WARNING: Your kernel does not support swap limit capabilities,
memory limited without swap.
 23 twitter.common.app debug: Initializing: twitter.common.log
(Logging subsystem.)
 24 Writing log files to disk in /mnt/mesos/sandbox
 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on
agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
 27 Writing log files to disk in /mnt/mesos/sandbox
 28 Traceback (most recent call last):
 29   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 126, in _excepting_run
 30 self.__real_run(*args, **kw)
 31   File "apache/thermos/monitoring/resource.py", line 243, in run
 32   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
line 79, in wait
 33 thread.start()
 34   File "/usr/lib/python2.7/threading.py", line 745, in start
 35 _start_new_thread(self.__bootstrap, ())
 36 thread.error: can't start new thread
 37 ERROR] *Failed to stop health checkers:*
 38 ERROR] Traceback (most recent call last):
 39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
 40 propagate_deadline(self._chained_checker.stop,
timeout=self.STOP_TIMEOUT)
 41   File "apache/aurora/executor/aurora_executor.py", line 35, in
propagate_deadline
 42 return deadline(*args, daemon=True, propagate=True, **kw)
 43   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
line 61, in deadline
 44 AnonymousThread().start()
 45   File "/usr/lib/python2.7/threading.py", line 745, in start
 46 _start_new_thread(self.__bootstrap, ())
 47 *error: can't start new thread*

48

 49 ERROR]* Failed to stop runner:*
50 ERROR] Traceback (most recent call last):
 51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
 53   File "apache/aurora/executor/aurora_executor.py", line 35, in
propagate_deadline
 54 return deadline(*args, daemon=True, propagate=True, **kw)
 55   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
line 61, in deadline
 56 AnonymousThread().start()
 57   File "/usr/lib/python2.7/threading.py", line 745, in start
 58 _start_new_thread(self.__bootstrap, ())
 59 *error: can't start new thread
* 60
 61 Traceback (most recent call last):
 62   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 126, in _excepting_run
 63 self.__real_run(*args, **kw)
 64   File "apache/aurora/executor/status_manager.py", line 62, in run
 65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
 66   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
line 56, in defer
 67 deferred.start()
 68   File "/usr/lib/python2.7/threading.py", line 745, in start
 69 _start_new_thread(self.__bootstrap, ())
 70* thread.error: can't start new thread*