[ https://issues.apache.org/jira/browse/YARN-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006883#comment-15006883 ]
Junping Du commented on YARN-4354: ---------------------------------- bq. I don't think there's anything magical about localization vs. the other things the NM is doing. The async dispatcher will only exit if an exception leaks up to the top, and when it does that's a programming error since it doesn't handle an exception properly. I agree there are no much different in overall. However, back to this case: from a user's prospective, an occasional NPE localization exception for a resource being cancelled could be better to be ignored (but get logged) rather than crash the NM. The price of ignoring the exception here could be potentially leaking file half localized (could be removed later) but the gain is the NM can be survival and keep working. We should at least provide this trade-off as a configurable choice to user. Isn't it? bq. If we're willing for NPEs in localization to not take down the NM, why are we willing to do the same if it happens in another NM subsystem that also uses the AsyncDispatcher? IMHO we should be consistent about the unexpected exception handling. I am not against to keep consistent for localization event handling with other subsystems, but not sure if ignoring other exceptional events could potentially cause NM ends up in a bad state. I think that is motivation we separate SchedulerEventDispatcher from RM dispatcher for general events with different setting/behavior. No? > Public resource localization fails with NPE > ------------------------------------------- > > Key: YARN-4354 > URL: https://issues.apache.org/jira/browse/YARN-4354 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.7.2 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Blocker > Fix For: 2.7.2 > > Attachments: YARN-4354-branch-2.7.002.patch, > YARN-4354-unittest.patch, YARN-4354.001.patch, YARN-4354.002.patch > > > I saw public localization on nodemanagers get stuck because it was constantly > rejecting requests to the thread pool executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)