[ 
https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113041#comment-15113041
 ] 

Carlo Curino commented on YARN-2885:
------------------------------------

[~asuresh], I skimmed the patch briefly, focusing on a couple of issues: 1) 
visibility/security of the extra information about the cluster state, 2) the 
LocalScheduler algos.

Sorry If I ask stupid questions, I haven't been following closely and looking 
at this code for super-long.

I like the idea of the {{DistributedSchedulingProtocol}} as a specialization of 
the {{ApplicationMasterProtocol}}. One thing which would make it even stronger 
is to enforce
the visibility/access to the extra information about cluster state, by means of 
tokens. This would allow you to say, every application in the cluster has the 
AMRM token, but
only the AMRRMProxy can add a special "DSP-Token" that grants visibility of the 
cluster state (being top-k or whatever extra info the DSP sends down the pipe).
Moreover, this would allow trusted and smart applications to also receive this 
information if the RM decide to grant them this privilege. This could be great 
for any AM that
has smarts that could determine where they want to run based on cluster load 
etc. 

(I am ok if this is done in an follow up JIRA, especially given you guys are 
working on a branch)

I started to look at the LocalScheduler code. I think I need some more comments 
to follow along. 

Minor in LocalScheduler (and surrounding classes):
 * The {{DistSchedulerParams}} hard-codes assumptions on the fact that 
resources are only mem/cpu, as work is ongoing to make that more general, I 
suggest to use Resource construct 
 * in updateResourceAsk() it is a bit confusing the use of "requeusts" as name 
for both input param and global variable. Can you change that? also having 
updateResourceAsk not 
   have side-effect but return a list might help.
 *  In {{OpportunisticContainerAllocator}} Why are you "resizing" containers? 
If the app is asking for an unadmissible container, I don't think it is correct 
to lower its ask to the largest acceptible container (Maybe rephrasing it as a 
question: Is this what the RM does?). 
   Also this math is doen on mem and cpu as integers, instead of on Resources 
(see above).
 * You use HashMap<Resource, ResourceRequest> but I don't see the reason for 
it, as you seem to scan the entire set anyway.
 

> Create AMRMProxy request interceptor for distributed scheduling decisions for 
> queueable containers
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2885
>                 URL: https://issues.apache.org/jira/browse/YARN-2885
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Konstantinos Karanasos
>            Assignee: Arun Suresh
>         Attachments: YARN-2885-yarn-2877.001.patch, 
> YARN-2885-yarn-2877.002.patch, YARN-2885-yarn-2877.full-2.patch, 
> YARN-2885-yarn-2877.full-3.patch, YARN-2885-yarn-2877.full.patch, 
> YARN-2885-yarn-2877.v4.patch, YARN-2885-yarn-2877.v5.patch, 
> YARN-2885-yarn-2877.v6.patch, YARN-2885_api_changes.patch
>
>
> We propose to add a Local ResourceManager (LocalRM) to the NM in order to 
> support distributed scheduling decisions. 
> Architecturally we leverage the RMProxy, introduced in YARN-2884. 
> The LocalRM makes distributed decisions for queuable containers requests. 
> Guaranteed-start requests are still handled by the central RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to