[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732429#comment-14732429 ]
Karthik Kambatla commented on YARN-2005: ---------------------------------------- Thanks for working on this, Anubhav. The approach looks good to me. Minor comments/nits on the patch itself: # Spurious changes/imports in AbstractYarnScheduler, FifoScheduler, RMAppImpl, TestAbstractYarnScheduler, TestAMRestart, YarnScheduler. # In AppSchedulingInfo ## synchronize *only* the common updateBlacklist method? ## Also, should we just synchronize on the list in question and not all of AppSchedulingInfo? I am fine with leaving it as is, if that is a lot of work and won't yield any big gains. ## The comments in the common updateBlaclist method refer to userBlacklist while the method operates on both system and user blacklists. ## Should transferStateFromPreviousAppSchedulingInfo update the systemBlacklist as well? Or, is the decision here to have the system blacklist per app-attempt? # BlacklistAdditionsRemovals: ## Mark it Private ## Rename to BlacklistUpdates ## Rename members blacklistAdditions and blacklistRemovals to additions and removals, and update the getters accordingly? # BlackListManager ## Mark it Private ## Rename addNodeContainerFailure to addNode? ## Rename getter? # In DisabledBlacklistManager, define a static EMPTY_LIST similar to SimpleBlacklistManager and use that to avoid creating two ArrayLists for each AppAttempt. # RMAppImpl: If am blacklisting is not enabled, we don't need to read the disable threshold. # RMAppAttempt: Instead of getAMBlacklist, should we call it getSystemBlacklist to be consistent with the way we refer to it in the scheduler? # RMAppAttemptImpl ## EMPTY_SYSTEM_BLACKLIST is unused ## Update any variables based on the method name - getAMBlacklist vs getSystemBlacklist ## Shouldn't we blacklist nodes on LaunchFailedTransition - may be in a follow-up JIRA? Can we file one if you agree. # The MockRM change is unrelated? I like the change, but may be we should do it in a separate clean-up JIRA. It might have a few other things to clean-up :) # TestAMRestart: A couple of unused variables. # Why are the changes to TestCapacityScheduler needed? They don't look related to this patch. # TestRMAppLogAggregationStatus change is unrelated. > Blacklisting support for scheduling AMs > --------------------------------------- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 0.23.10, 2.4.0 > Reporter: Jason Lowe > Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, > YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, > YARN-2005.008.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)