Is my understanding correct that the Thermos transitions the task to
TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
workflow for thermos, as a 1:1 task:executor approach, is that the executor
terminates itself after the task is terminal.

The full logs of the agent during this window would help, it looks like an
agent termination is involved here as well?

On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <[email protected]> wrote:

> Here are some relevant logs. Aurora scheduler logs shows the task going
> from:
> INIT
> ->PENDING
> ->ASSIGNED
> ->STARTING
> ->RUNNING for a long time
> ->FAILED due to health check error, OSError: Resource temporarily
> unavailable (I think this is referring to running out of PID space, see
> thermos logs below)
>
>
> --- mesos agent ---
>
> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the 
> sandbox directory
> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource 
> '/usr/bin/xxxxx' to 
> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to 
> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
> Writing log files to disk in /mnt/mesos/sandbox
> I1005 22:58:15.677225     7 exec.cpp:162] Version: 1.1.0
> I1005 22:58:15.680867    14 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
> Writing log files to disk in /mnt/mesos/sandbox
> I1006 01:13:52.950552    39 exec.cpp:487] Agent exited, but framework has 
> checkpointing enabled. Waiting 365days to reconnect with agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>
>
>
> --- thermos (Aurora) ----
>
> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to 
> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>  24 Writing log files to disk in /mnt/mesos/sandbox
>  25 I1023 19:04:32.261165     7 exec.cpp:162] Version: 1.2.0
>  26 I1023 19:04:32.264870    42 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>  27 Writing log files to disk in /mnt/mesos/sandbox
>  28 Traceback (most recent call last):
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 1    26, in _excepting_run
>  30     self.__real_run(*args, **kw)
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  lin    e 79, in wait
>  33     thread.start()
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>  35     _start_new_thread(self.__bootstrap, ())
>  36 thread.error: can't start new thread
>  37 ERROR] Failed to stop health checkers:
>  38 ERROR] Traceback (most recent call last):
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>  40     propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  42     return deadline(*args, daemon=True, propagate=True, **kw)
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 6    1, in deadline
>  44     AnonymousThread().start()
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>  46     _start_new_thread(self.__bootstrap, ())
>  47 error: can't start new thread
>  48
>  49 ERROR] Failed to stop runner:
> 50 ERROR] Traceback (most recent call last):
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>  52     propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  54     return deadline(*args, daemon=True, propagate=True, **kw)
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 6    1, in deadline
>  56     AnonymousThread().start()
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>  58     _start_new_thread(self.__bootstrap, ())
>  59 error: can't start new thread
>  60
>  61 Traceback (most recent call last):
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 1    26, in _excepting_run
>  63     self.__real_run(*args, **kw)
>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>  66   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>  line 5    6, in defer
>  67     deferred.start()
>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>  69     _start_new_thread(self.__bootstrap, ())
>  70 thread.error: can't start new thread
>
>
>
>
> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <[email protected]> wrote:
>
>> Can you share the agent and executor logs of an example orphaned
>> executor? That would help us diagnose the issue.
>>
>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <[email protected]>
>> wrote:
>>
>>> Folks,
>>> Often I see some orphaned executors in my cluster. These are cases where
>>> the framework was informed of task loss, so has forgotten about them as
>>> expected, but the container(docker) is still around. AFAIK, Mesos agent is
>>> the only entity that has knowledge of these containers. How do I ensure
>>> that they get cleaned up by the agent?
>>>
>>> Mohit.
>>>
>>
>>
>

Reply via email to