I filed one: https://issues.apache.org/jira/browse/MESOS-8167
It's a pretty significant effort, and hasn't been requested a lot, so it's unlikely to be worked on for some time. On Tue, Oct 31, 2017 at 8:18 PM, Mohit Jaggi <[email protected]> wrote: > :-) > Is there a Jira ticket to track this? Any idea when this will be worked on? > > On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler <[email protected]> > wrote: > >> The question was posed merely to point out that there is no notion of the >> executor "running away" currently, due to the answer I provided: there >> isn't a complete lifecycle API for the executor. (This includes >> healthiness, state updates, reconciliation, ability for scheduler to shut >> it down, etc). >> >> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <[email protected]> >> wrote: >> >>> Good question. >>> - I don't know what the interaction between mesos agent and executor is. >>> Is there a health check? >>> - There is a reconciliation between Mesos and Frameworks: will Mesos >>> include the "orphan" executor in the list there, so framework can find >>> runaways and kill them(using Mesos provided API)? >>> >>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <[email protected]> >>> wrote: >>> >>>> What defines a runaway executor? >>>> >>>> Mesos does not know that this particular executor should self-terminate >>>> within some reasonable time after its task terminates. In this case the >>>> framework (Aurora) knows this expected behavior of Thermos and can clean up >>>> ones that get stuck after the task terminates. However, we currently don't >>>> provide a great executor lifecycle API to enable schedulers to do this >>>> (it's long overdue). >>>> >>>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <[email protected]> >>>> wrote: >>>> >>>>> I was asking if this can happen automatically. >>>>> >>>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <[email protected]> >>>>> wrote: >>>>> >>>>>> You can kill it manually by SIGKILLing the executor process. >>>>>> Using the agent API, you can launch a nested container session and >>>>>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing >>>>>> into >>>>>> the container? >>>>>> >>>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit >>>>>>> in such scenarios. But I am curious to know if Mesos agent has the >>>>>>> functionality to reap runaway executors. >>>>>>> >>>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Is my understanding correct that the Thermos transitions the task >>>>>>>> to TASK_FAILED, but Thermos gets stuck and can't terminate itself? The >>>>>>>> typical workflow for thermos, as a 1:1 task:executor approach, is that >>>>>>>> the >>>>>>>> executor terminates itself after the task is terminal. >>>>>>>> >>>>>>>> The full logs of the agent during this window would help, it looks >>>>>>>> like an agent termination is involved here as well? >>>>>>>> >>>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task >>>>>>>>> going from: >>>>>>>>> INIT >>>>>>>>> ->PENDING >>>>>>>>> ->ASSIGNED >>>>>>>>> ->STARTING >>>>>>>>> ->RUNNING for a long time >>>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily >>>>>>>>> unavailable (I think this is referring to running out of PID space, >>>>>>>>> see >>>>>>>>> thermos logs below) >>>>>>>>> >>>>>>>>> >>>>>>>>> --- mesos agent --- >>>>>>>>> >>>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into >>>>>>>>> the sandbox directory >>>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI >>>>>>>>> '/usr/bin/XXXXX' >>>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource >>>>>>>>> '/usr/bin/xxxxx' to >>>>>>>>> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx' >>>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' >>>>>>>>> to >>>>>>>>> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx' >>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory >>>>>>>>> limited without swap. >>>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging >>>>>>>>> subsystem.) >>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox >>>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0 >>>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on >>>>>>>>> agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540 >>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox >>>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework >>>>>>>>> has checkpointing enabled. Waiting 365days to reconnect with agent >>>>>>>>> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --- thermos (Aurora) ---- >>>>>>>>> >>>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' >>>>>>>>> to >>>>>>>>> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx' >>>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, >>>>>>>>> memory limited without swap. >>>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log >>>>>>>>> (Logging subsystem.) >>>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox >>>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0 >>>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on >>>>>>>>> agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295 >>>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox >>>>>>>>> 28 Traceback (most recent call last): >>>>>>>>> 29 File >>>>>>>>> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", >>>>>>>>> line 1 26, in _excepting_run >>>>>>>>> 30 self.__real_run(*args, **kw) >>>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run >>>>>>>>> 32 File >>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", >>>>>>>>> lin e 79, in wait >>>>>>>>> 33 thread.start() >>>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start >>>>>>>>> 35 _start_new_thread(self.__bootstrap, ()) >>>>>>>>> 36 thread.error: can't start new thread >>>>>>>>> 37 ERROR] Failed to stop health checkers: >>>>>>>>> 38 ERROR] Traceback (most recent call last): >>>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in >>>>>>>>> _shutdown >>>>>>>>> 40 propagate_deadline(self._chained_checker.stop, >>>>>>>>> timeout=self.STOP_TIMEOUT) >>>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in >>>>>>>>> propagate_deadline >>>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw) >>>>>>>>> 43 File >>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", >>>>>>>>> line 6 1, in deadline >>>>>>>>> 44 AnonymousThread().start() >>>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start >>>>>>>>> 46 _start_new_thread(self.__bootstrap, ()) >>>>>>>>> 47 error: can't start new thread >>>>>>>>> 48 >>>>>>>>> 49 ERROR] Failed to stop runner: >>>>>>>>> 50 ERROR] Traceback (most recent call last): >>>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in >>>>>>>>> _shutdown >>>>>>>>> 52 propagate_deadline(self._runner.stop, >>>>>>>>> timeout=self.STOP_TIMEOUT) >>>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in >>>>>>>>> propagate_deadline >>>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw) >>>>>>>>> 55 File >>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", >>>>>>>>> line 6 1, in deadline >>>>>>>>> 56 AnonymousThread().start() >>>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start >>>>>>>>> 58 _start_new_thread(self.__bootstrap, ()) >>>>>>>>> 59 error: can't start new thread >>>>>>>>> 60 >>>>>>>>> 61 Traceback (most recent call last): >>>>>>>>> 62 File >>>>>>>>> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", >>>>>>>>> line 1 26, in _excepting_run >>>>>>>>> 63 self.__real_run(*args, **kw) >>>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run >>>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in >>>>>>>>> _shutdown >>>>>>>>> 66 File >>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", >>>>>>>>> line 5 6, in defer >>>>>>>>> 67 deferred.start() >>>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start >>>>>>>>> 69 _start_new_thread(self.__bootstrap, ()) >>>>>>>>> 70 thread.error: can't start new thread >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Can you share the agent and executor logs of an example orphaned >>>>>>>>>> executor? That would help us diagnose the issue. >>>>>>>>>> >>>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Folks, >>>>>>>>>>> Often I see some orphaned executors in my cluster. These are >>>>>>>>>>> cases where the framework was informed of task loss, so has >>>>>>>>>>> forgotten about >>>>>>>>>>> them as expected, but the container(docker) is still around. AFAIK, >>>>>>>>>>> Mesos >>>>>>>>>>> agent is the only entity that has knowledge of these containers. >>>>>>>>>>> How do I >>>>>>>>>>> ensure that they get cleaned up by the agent? >>>>>>>>>>> >>>>>>>>>>> Mohit. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >

