Yes. There is a fix available now in Aurora/Thermos to try and exit in such scenarios. But I am curious to know if Mesos agent has the functionality to reap runaway executors.
On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <[email protected]> wrote: > Is my understanding correct that the Thermos transitions the task to > TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical > workflow for thermos, as a 1:1 task:executor approach, is that the executor > terminates itself after the task is terminal. > > The full logs of the agent during this window would help, it looks like an > agent termination is involved here as well? > > On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <[email protected]> wrote: > >> Here are some relevant logs. Aurora scheduler logs shows the task going >> from: >> INIT >> ->PENDING >> ->ASSIGNED >> ->STARTING >> ->RUNNING for a long time >> ->FAILED due to health check error, OSError: Resource temporarily >> unavailable (I think this is referring to running out of PID space, see >> thermos logs below) >> >> >> --- mesos agent --- >> >> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the >> sandbox directory >> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX' >> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource >> '/usr/bin/xxxxx' to >> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx' >> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to >> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx' >> WARNING: Your kernel does not support swap limit capabilities, memory >> limited without swap. >> twitter.common.app debug: Initializing: twitter.common.log (Logging >> subsystem.) >> Writing log files to disk in /mnt/mesos/sandbox >> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0 >> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent >> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540 >> Writing log files to disk in /mnt/mesos/sandbox >> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has >> checkpointing enabled. Waiting 365days to reconnect with agent >> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540 >> >> >> >> --- thermos (Aurora) ---- >> >> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to >> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx' >> 22 WARNING: Your kernel does not support swap limit capabilities, memory >> limited without swap. >> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging >> subsystem.) >> 24 Writing log files to disk in /mnt/mesos/sandbox >> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 >> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent >> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 >> 27 Writing log files to disk in /mnt/mesos/sandbox >> 28 Traceback (most recent call last): >> 29 File >> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", >> line 1 26, in _excepting_run >> 30 self.__real_run(*args, **kw) >> 31 File "apache/thermos/monitoring/resource.py", line 243, in run >> 32 File >> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", >> lin e 79, in wait >> 33 thread.start() >> 34 File "/usr/lib/python2.7/threading.py", line 745, in start >> 35 _start_new_thread(self.__bootstrap, ()) >> 36 thread.error: can't start new thread >> 37 ERROR] Failed to stop health checkers: >> 38 ERROR] Traceback (most recent call last): >> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in >> _shutdown >> 40 propagate_deadline(self._chained_checker.stop, >> timeout=self.STOP_TIMEOUT) >> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in >> propagate_deadline >> 42 return deadline(*args, daemon=True, propagate=True, **kw) >> 43 File >> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", >> line 6 1, in deadline >> 44 AnonymousThread().start() >> 45 File "/usr/lib/python2.7/threading.py", line 745, in start >> 46 _start_new_thread(self.__bootstrap, ()) >> 47 error: can't start new thread >> 48 >> 49 ERROR] Failed to stop runner: >> 50 ERROR] Traceback (most recent call last): >> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in >> _shutdown >> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) >> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in >> propagate_deadline >> 54 return deadline(*args, daemon=True, propagate=True, **kw) >> 55 File >> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", >> line 6 1, in deadline >> 56 AnonymousThread().start() >> 57 File "/usr/lib/python2.7/threading.py", line 745, in start >> 58 _start_new_thread(self.__bootstrap, ()) >> 59 error: can't start new thread >> 60 >> 61 Traceback (most recent call last): >> 62 File >> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", >> line 1 26, in _excepting_run >> 63 self.__real_run(*args, **kw) >> 64 File "apache/aurora/executor/status_manager.py", line 62, in run >> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in >> _shutdown >> 66 File >> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", >> line 5 6, in defer >> 67 deferred.start() >> 68 File "/usr/lib/python2.7/threading.py", line 745, in start >> 69 _start_new_thread(self.__bootstrap, ()) >> 70 thread.error: can't start new thread >> >> >> >> >> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <[email protected]> wrote: >> >>> Can you share the agent and executor logs of an example orphaned >>> executor? That would help us diagnose the issue. >>> >>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Folks, >>>> Often I see some orphaned executors in my cluster. These are cases >>>> where the framework was informed of task loss, so has forgotten about them >>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent >>>> is the only entity that has knowledge of these containers. How do I ensure >>>> that they get cleaned up by the agent? >>>> >>>> Mohit. >>>> >>> >>> >> >

