Re: orphan executor

Mohit Jaggi Tue, 31 Oct 2017 20:18:50 -0700

:-)
Is there a Jira ticket to track this? Any idea when this will be worked on?


On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler <[email protected]> wrote:

> The question was posed merely to point out that there is no notion of the
> executor "running away" currently, due to the answer I provided: there
> isn't a complete lifecycle API for the executor. (This includes
> healthiness, state updates, reconciliation, ability for scheduler to shut
> it down, etc).
>
> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <[email protected]> wrote:
>
>> Good question.
>> - I don't know what the interaction between mesos agent and executor is.
>> Is there a health check?
>> - There is a reconciliation between Mesos and Frameworks: will Mesos
>> include the "orphan" executor in the list there, so framework can find
>> runaways and kill them(using Mesos provided API)?
>>
>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <[email protected]>
>> wrote:
>>
>>> What defines a runaway executor?
>>>
>>> Mesos does not know that this particular executor should self-terminate
>>> within some reasonable time after its task terminates. In this case the
>>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>>> ones that get stuck after the task terminates. However, we currently don't
>>> provide a great executor lifecycle API to enable schedulers to do this
>>> (it's long overdue).
>>>
>>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> I was asking if this can happen automatically.
>>>>
>>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <[email protected]>
>>>> wrote:
>>>>
>>>>> You can kill it manually by SIGKILLing the executor process.
>>>>> Using the agent API, you can launch a nested container session and
>>>>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing into
>>>>> the container?
>>>>>
>>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit
>>>>>> in such scenarios. But I am curious to know if Mesos agent has the
>>>>>> functionality to reap runaway executors.
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Is my understanding correct that the Thermos transitions the task to
>>>>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The 
>>>>>>> typical
>>>>>>> workflow for thermos, as a 1:1 task:executor approach, is that the 
>>>>>>> executor
>>>>>>> terminates itself after the task is terminal.
>>>>>>>
>>>>>>> The full logs of the agent during this window would help, it looks
>>>>>>> like an agent termination is involved here as well?
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>>> going from:
>>>>>>>> INIT
>>>>>>>> ->PENDING
>>>>>>>> ->ASSIGNED
>>>>>>>> ->STARTING
>>>>>>>> ->RUNNING for a long time
>>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>>> thermos logs below)
>>>>>>>>
>>>>>>>>
>>>>>>>> --- mesos agent ---
>>>>>>>>
>>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into 
>>>>>>>> the sandbox directory
>>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI 
>>>>>>>> '/usr/bin/XXXXX'
>>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource 
>>>>>>>> '/usr/bin/xxxxx' to 
>>>>>>>> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' 
>>>>>>>> to 
>>>>>>>> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory 
>>>>>>>> limited without swap.
>>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging 
>>>>>>>> subsystem.)
>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> I1005 22:58:15.677225     7 exec.cpp:162] Version: 1.1.0
>>>>>>>> I1005 22:58:15.680867    14 exec.cpp:237] Executor registered on agent 
>>>>>>>> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> I1006 01:13:52.950552    39 exec.cpp:487] Agent exited, but framework 
>>>>>>>> has checkpointing enabled. Waiting 365days to reconnect with agent 
>>>>>>>> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- thermos (Aurora) ----
>>>>>>>>
>>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' 
>>>>>>>> to 
>>>>>>>> '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>>>  22 WARNING: Your kernel does not support swap limit capabilities, 
>>>>>>>> memory limited without swap.
>>>>>>>>  23 twitter.common.app debug: Initializing: twitter.common.log 
>>>>>>>> (Logging subsystem.)
>>>>>>>>  24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>  25 I1023 19:04:32.261165     7 exec.cpp:162] Version: 1.2.0
>>>>>>>>  26 I1023 19:04:32.264870    42 exec.cpp:237] Executor registered on 
>>>>>>>> agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>>>  27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>  28 Traceback (most recent call last):
>>>>>>>>  29   File 
>>>>>>>> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>>>>>>>>  line 1    26, in _excepting_run
>>>>>>>>  30     self.__real_run(*args, **kw)
>>>>>>>>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>>>  32   File 
>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>>>>>>>>  lin    e 79, in wait
>>>>>>>>  33     thread.start()
>>>>>>>>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>  35     _start_new_thread(self.__bootstrap, ())
>>>>>>>>  36 thread.error: can't start new thread
>>>>>>>>  37 ERROR] Failed to stop health checkers:
>>>>>>>>  38 ERROR] Traceback (most recent call last):
>>>>>>>>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in 
>>>>>>>> _shutdown
>>>>>>>>  40     propagate_deadline(self._chained_checker.stop, 
>>>>>>>> timeout=self.STOP_TIMEOUT)
>>>>>>>>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
>>>>>>>> propagate_deadline
>>>>>>>>  42     return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>>  43   File 
>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>>>>>>>>  line 6    1, in deadline
>>>>>>>>  44     AnonymousThread().start()
>>>>>>>>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>  46     _start_new_thread(self.__bootstrap, ())
>>>>>>>>  47 error: can't start new thread
>>>>>>>>  48
>>>>>>>>  49 ERROR] Failed to stop runner:
>>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>>>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in 
>>>>>>>> _shutdown
>>>>>>>>  52     propagate_deadline(self._runner.stop, 
>>>>>>>> timeout=self.STOP_TIMEOUT)
>>>>>>>>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
>>>>>>>> propagate_deadline
>>>>>>>>  54     return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>>  55   File 
>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>>>>>>>>  line 6    1, in deadline
>>>>>>>>  56     AnonymousThread().start()
>>>>>>>>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>  58     _start_new_thread(self.__bootstrap, ())
>>>>>>>>  59 error: can't start new thread
>>>>>>>>  60
>>>>>>>>  61 Traceback (most recent call last):
>>>>>>>>  62   File 
>>>>>>>> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>>>>>>>>  line 1    26, in _excepting_run
>>>>>>>>  63     self.__real_run(*args, **kw)
>>>>>>>>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>>>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in 
>>>>>>>> _shutdown
>>>>>>>>  66   File 
>>>>>>>> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>>>>>>>>  line 5    6, in defer
>>>>>>>>  67     deferred.start()
>>>>>>>>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>  69     _start_new_thread(self.__bootstrap, ())
>>>>>>>>  70 thread.error: can't start new thread
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Folks,
>>>>>>>>>> Often I see some orphaned executors in my cluster. These are
>>>>>>>>>> cases where the framework was informed of task loss, so has 
>>>>>>>>>> forgotten about
>>>>>>>>>> them as expected, but the container(docker) is still around. AFAIK, 
>>>>>>>>>> Mesos
>>>>>>>>>> agent is the only entity that has knowledge of these containers. How 
>>>>>>>>>> do I
>>>>>>>>>> ensure that they get cleaned up by the agent?
>>>>>>>>>>
>>>>>>>>>> Mohit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: orphan executor

Reply via email to