Junping Du commented on YARN-4354:

bq. I don't think there's anything magical about localization vs. the other 
things the NM is doing. The async dispatcher will only exit if an exception 
leaks up to the top, and when it does that's a programming error since it 
doesn't handle an exception properly.
I agree there are no much different in overall. However, back to this case: 
from a user's prospective, an occasional NPE localization exception for a 
resource being cancelled could be better to be ignored (but get logged) rather 
than crash the NM. The price of ignoring the exception here could be 
potentially leaking file half localized (could be removed later) but the gain 
is the NM can be survival and keep working. We should at least provide this 
trade-off as a configurable choice to user. Isn't it?

bq.  If we're willing for NPEs in localization to not take down the NM, why are 
we willing to do the same if it happens in another NM subsystem that also uses 
the AsyncDispatcher? IMHO we should be consistent about the unexpected 
exception handling.
I am not against to keep consistent for localization event handling with other 
subsystems, but not sure if ignoring other exceptional events could potentially 
cause NM ends up in a bad state. I think that is motivation we separate 
SchedulerEventDispatcher from RM dispatcher for general events with different 
setting/behavior. No?

> Public resource localization fails with NPE
> -------------------------------------------
>                 Key: YARN-4354
>                 URL: https://issues.apache.org/jira/browse/YARN-4354
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.2
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>             Fix For: 2.7.2
>         Attachments: YARN-4354-branch-2.7.002.patch, 
> YARN-4354-unittest.patch, YARN-4354.001.patch, YARN-4354.002.patch
> I saw public localization on nodemanagers get stuck because it was constantly 
> rejecting requests to the thread pool executor.

This message was sent by Atlassian JIRA

Reply via email to