[ 
https://issues.apache.org/jira/browse/YARN-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7052:
-----------------------------
    Description: 
In YARN-7051, we ran into a case where the preemption monitor thread hung with 
no indication of why.

The preemption monitor is started by the {{SchedulingExecutorService}} from 
{{SchedulingMonitor#serviceStart}}. Once an uncaught throwable happens, nothing 
ever gets the result of the future, the thread running the preemption monitor 
never dies, and it never gets rescheduled.

If {{HadoopExecutor}} were used, it would at least provide a 
{{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens.

  was:
In YARN-7051, we ran into a case where the preemption monitor thread hung with 
no indication of why. This was because the preemption monitor is started by the 
{{SchedulingExecutorService}} from {{SchedulingMonitor#serviceStart}}, and then 
nothing ever gets the result of the future or allows it to throw an exception 
if needed.

At least with {{HadoopExecutor}}, it will provide a 
{{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens.

        Summary: RM SchedulingMonitor gives no indication why the spawned 
thread crashed.  (was: RM SchedulingMonitor should use HadoopExecutors when 
creating ScheduledExecutorService)

Using {{HadoopExecutor}} may not be feasible.

The following is used to launch the preemption thread:
{code:title=SchedulingMonitor#serviceStart}
    ses = Executors.newSingleThreadScheduledExecutor(new ThreadFactory() {
...
    handler = ses.scheduleAtFixedRate(new PreemptionChecker(),
        0, monitorInterval, TimeUnit.MILLISECONDS);
{code}

{{HadoopExecutors}} provides a {{newSingleThreadScheduledExecutor}} interface, 
but it just turns around and calls 
{{Executors#newSingleThreadScheduledExecutor}}. The 
{{HadoopExecutors#newSingleThreadScheduledExecutor}} method does not provide 
the {{HadoopScheduledThreadPoolExecutor}} wrapper in the return value of that 
interface, so you don't get the logging benefits if you use 
{{HadoopExecutors#newSingleThreadScheduledExecutor}}

Alternatively, we could have the thread itself catch and handle throwables.

The thread being launched by {{SchedulingMonitor#serviceStart}} is calling 
{{PreemptionChecker#run}}, which only handles {{YarnRuntimeException}}. 
Anything else will cause the thread to hang and not get rescheduled.

I suggest that another solution would be to handle other throwables, log them, 
and either re-throw or cancel the thread.

> RM SchedulingMonitor gives no indication why the spawned thread crashed.
> ------------------------------------------------------------------------
>
>                 Key: YARN-7052
>                 URL: https://issues.apache.org/jira/browse/YARN-7052
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> In YARN-7051, we ran into a case where the preemption monitor thread hung 
> with no indication of why.
> The preemption monitor is started by the {{SchedulingExecutorService}} from 
> {{SchedulingMonitor#serviceStart}}. Once an uncaught throwable happens, 
> nothing ever gets the result of the future, the thread running the preemption 
> monitor never dies, and it never gets rescheduled.
> If {{HadoopExecutor}} were used, it would at least provide a 
> {{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to