[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425543#comment-16425543
 ] 

Gergo Repas commented on YARN-7913:
-----------------------------------

[~oshevchenko] I see, that's a good point that ACL changes may cause similar 
issues.

In this ticket my proposal would be to improve error-handling in general. I'd 
propose to try to recover all applications in a passive-to-active RM 
transition-attempt even with the presence of exceptions (logging all 
exceptions, rethrowing the last one), and the async event processing would take 
care of putting the application attempt to rejected state. This way we would 
have only a couple of passive-to-active RM transition-attempts (depending on 
how long the async event processing takes), instead of N+1 passive-to-active 
attempts, where N is the number of rejected applications. This is to improve 
the handling of any unforeseen exceptions.

What do you think?

> Improve error handling when application recovery fails with exception
> ---------------------------------------------------------------------
>
>                 Key: YARN-7913
>                 URL: https://issues.apache.org/jira/browse/YARN-7913
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Gergo Repas
>            Assignee: Gergo Repas
>            Priority: Major
>         Attachments: YARN-7913.000.poc.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to