Tsuyoshi OZAWA commented on YARN-2313:

Hi Karthik, thank you for pointing it out.

 So, irrespective of how long update() takes the next Thread.sleep is called 
for 500 ms, no?

You're correct. The description "go busy loop" is wrong. But there still 
remains starvation problem:

1. {{FairScheduler#update()}} can take more than 10 sec, default value of 
reloadIntervalMs, with lock.
2. {{AllocationFileLoaderThread#onReload}} can take more than 500 ms, default 
value of updateInterval, with lock.
3. As a result, {{FairScheduler#update()}} and {{FairScheduler#onReload}} can 
always wins lock of the instance of {{FairScheduler}}.
4. {{ResourceManager$SchedulerEventDispatcher}} can wait forever.

The problem we faced was that cluster(note that it's very busy cluster!) hung 
up even after killing exist apps. I got the stack trace when we faced the 
problem. In our case, we can avoid the problem by setting the configuration 
value(updateInterval) larger. IIUC, it's because we can have the margin that 
ResourceManager$SchedulerEventDispatcher acquire lock. 

As you mentioned, this fix is just a workaround. However, it's effective. More 
essential way is making updateInterval and reloadIntervalMs dynamic. Please 
correct me if I'm wrong. 

> Livelock can occur in FairScheduler when there are lots of running apps
> -----------------------------------------------------------------------
>                 Key: YARN-2313
>                 URL: https://issues.apache.org/jira/browse/YARN-2313
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.4.1
>            Reporter: Tsuyoshi OZAWA
>            Assignee: Tsuyoshi OZAWA
>             Fix For: 2.6.0
>         Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.

This message was sent by Atlassian JIRA

Reply via email to