Gergo Repas created YARN-7913:

             Summary: Improve error handling when application recovery fails 
with exception
                 Key: YARN-7913
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
    Affects Versions: 3.0.0
            Reporter: Gergo Repas
            Assignee: Gergo Repas

There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue 
is a parent queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for 
applications on this queue, and an APP_REJECTED event will be dispatched to the 
async dispatcher. On the same thread (that handles the recovery), a 
NullPointerException is thrown when the applicationAttempt is tried to be 
 I don't see a good way to avoid the NPE in this scenario, because when the NPE 
occurs the APP_REJECTED has not been processed yet, and we don't know that the 
application recovery failed.

Currently the first exception will abort the recovery, and if there are X 
applications, there will be ~X passive -> active RM transition attempts - the 
passive -> active RM transition will only succeed when the last APP_REJECTED 
event is processed on the async dispatcher thread.

_The point of this ticket is to improve the error handling and reduce the 
number of passive -> active RM transition attempts (solving the above described 
failure scenario isn't in scope)._

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to