[
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015009#comment-17015009
]
Szilard Nemeth commented on YARN-7913:
--------------------------------------
Hi [~wilfreds],
Has read through all comments and checked your patch.
1. IIUC, your approach is a new one compared to Gergo's previous PoC patch,
right?
2. If I was just checking the description: "The point of this ticket is to
improve the error handling and reduce the number of passive -> active RM
transition attempts (solving the above described failure scenario isn't in
scope).", it was not in-line with the patch you uploaded.
I think it would be better to remove this statement or add something to the
desription that describes what has been done recently with your patch. This
way, we could make a better experience of future readers of this jira.
What do you think?
3. FairScheduler#addApplication is invoked by processing the
SchedulerEventType#APP_ADDED event. This event type is used by
AppAddedSchedulerEvent.
The calls to constructors of this event class from RMAppImpl are happening when
an application is added or recovered. I can see the following calls in
RmAppImpl:
{code}
1.
app.handler.handle(
new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
app.applicationPriority, app.placementContext));
2. app.scheduler.handle(
new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
app.applicationPriority, app.placementContext));
3.
// Add application to scheduler synchronously to guarantee scheduler
// knows applications before AM or NM re-registers.
app.scheduler.handle(
new AppAddedSchedulerEvent(app.user, app.submissionContext, true,
app.applicationPriority, app.placementContext));
{code}
The boolean parameter is set to isAppRecovering field of the event, so this is
why I came to the conclusion that apps being created or recovered are both sent
to the event dispatcher.
So based on the above statement, the code you added to
FairScheduler#addApplication seems a bit incorrect:
{code}
// app is recovering we do not want to fail the app now as it was there
// before we started the recovery. Add it to the recovery queue:
// dynamic queue directly under root, no ACL needed (auto clean up)
queueName = "root.recovery";
queue = queueMgr.getLeafQueue(queueName, true, applicationId);
{code}
Shouldn't we need to check the isAppRecovering flag and only run this code
block if the flag is true, meaning the app is recovering?
4. Nit: The comment has a missing comma at least:
{code}
// Skip ACL check for recovering applications: they have been accepted
// in the queue already recovery should not break that.
{code}
5. One thing I don't fully understand:
{code}
// when recovering the NMs might not have registered and we could have
// no resources in the queue, the app is already running and has thus
// passed all these checks, skip them now.
if (!isAppRecovering && rmApp != null &&
rmApp.getAMResourceRequests() != null) {
....
{code}
Is this condition for checking the apps that are NOT recovering?
6. I can imagine some improvement that could be filed as a follow-up jira: I
was having a bit hard time to have an understanding of the conditions inside
FairScheduler#addApplication.
For example, the statements inside the if-else conditions could be extracted to
methods with meaningful names, so code readers would understand what happens,
right away by reading the code. If they are interested in details of a
particular case, they could open the method anyway. Can you file a jira for
this?
thanks!
> Improve error handling when application recovery fails with exception
> ---------------------------------------------------------------------
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 3.0.0
> Reporter: Gergo Repas
> Assignee: Wilfred Spiegelenburg
> Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch,
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
> * setup: a queue is a leaf queue in the primary RM's config and the same
> queue is a parent queue in the secondary RM's config.
> * When failover happens with this setup, the recovery will fail for
> applications on this queue, and an APP_REJECTED event will be dispatched to
> the async dispatcher. On the same thread (that handles the recovery), a
> NullPointerException is thrown when the applicationAttempt is tried to be
> recovered
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
> I don't see a good way to avoid the NPE in this scenario, because when the
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X
> applications, there will be ~X passive -> active RM transition attempts - the
> passive -> active RM transition will only succeed when the last APP_REJECTED
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the
> number of passive -> active RM transition attempts (solving the above
> described failure scenario isn't in scope)._
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]