Here are some relevant logs. Aurora scheduler logs shows the task going
from:
INIT
->PENDING
->ASSIGNED
->STARTING
->RUNNING for a long time
->FAILED due to health check error, OSError: Resource temporarily
unavailable (I think this is referring to running out of PID space, see
thermos logs below)


--- mesos agent ---

I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into
the sandbox directory
I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource
'/usr/bin/xxxxx' to
'/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx'
to 
'/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
WARNING: Your kernel does not support swap limit capabilities, memory
limited without swap.
twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
Writing log files to disk in /mnt/mesos/sandbox
I1005 22:58:15.677225     7 exec.cpp:162] Version: 1.1.0
I1005 22:58:15.680867    14 exec.cpp:237] Executor registered on agent
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
Writing log files to disk in /mnt/mesos/sandbox
I1006 01:13:52.950552    39 exec.cpp:487] Agent exited, but framework
has checkpointing enabled. Waiting 365days to reconnect with agent
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540



--- thermos (Aurora) ----

1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx'
to 
'/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
 22 WARNING: Your kernel does not support swap limit capabilities,
memory limited without swap.
 23 twitter.common.app debug: Initializing: twitter.common.log
(Logging subsystem.)
 24 Writing log files to disk in /mnt/mesos/sandbox
 25 I1023 19:04:32.261165     7 exec.cpp:162] Version: 1.2.0
 26 I1023 19:04:32.264870    42 exec.cpp:237] Executor registered on
agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
 27 Writing log files to disk in /mnt/mesos/sandbox
 28 Traceback (most recent call last):
 29   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 1    26, in _excepting_run
 30     self.__real_run(*args, **kw)
 31   File "apache/thermos/monitoring/resource.py", line 243, in run
 32   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
lin    e 79, in wait
 33     thread.start()
 34   File "/usr/lib/python2.7/threading.py", line 745, in start
 35     _start_new_thread(self.__bootstrap, ())
 36 thread.error: can't start new thread
 37 ERROR] Failed to stop health checkers:
 38 ERROR] Traceback (most recent call last):
 39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
 40     propagate_deadline(self._chained_checker.stop,
timeout=self.STOP_TIMEOUT)
 41   File "apache/aurora/executor/aurora_executor.py", line 35, in
propagate_deadline
 42     return deadline(*args, daemon=True, propagate=True, **kw)
 43   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
line 6    1, in deadline
 44     AnonymousThread().start()
 45   File "/usr/lib/python2.7/threading.py", line 745, in start
 46     _start_new_thread(self.__bootstrap, ())
 47 error: can't start new thread
 48
 49 ERROR] Failed to stop runner:
50 ERROR] Traceback (most recent call last):
 51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
 52     propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
 53   File "apache/aurora/executor/aurora_executor.py", line 35, in
propagate_deadline
 54     return deadline(*args, daemon=True, propagate=True, **kw)
 55   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
line 6    1, in deadline
 56     AnonymousThread().start()
 57   File "/usr/lib/python2.7/threading.py", line 745, in start
 58     _start_new_thread(self.__bootstrap, ())
 59 error: can't start new thread
 60
 61 Traceback (most recent call last):
 62   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 1    26, in _excepting_run
 63     self.__real_run(*args, **kw)
 64   File "apache/aurora/executor/status_manager.py", line 62, in run
 65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
 66   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
line 5    6, in defer
 67     deferred.start()
 68   File "/usr/lib/python2.7/threading.py", line 745, in start
 69     _start_new_thread(self.__bootstrap, ())
 70 thread.error: can't start new thread




On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <[email protected]> wrote:

> Can you share the agent and executor logs of an example orphaned executor?
> That would help us diagnose the issue.
>
> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <[email protected]> wrote:
>
>> Folks,
>> Often I see some orphaned executors in my cluster. These are cases where
>> the framework was informed of task loss, so has forgotten about them as
>> expected, but the container(docker) is still around. AFAIK, Mesos agent is
>> the only entity that has knowledge of these containers. How do I ensure
>> that they get cleaned up by the agent?
>>
>> Mohit.
>>
>
>

Reply via email to