[ 
https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-2888:
------------------------------
    Attachment: YARN-2888.006.patch

Thank you [~curino] for the detailed review.. apologize for the delay in 
addressing them..

bq. ..I see you add a few extra conf parameters. I was wondering whether we can 
come up with a better mechanism to configure policies, other than global conf 
parameters...
If I understand your proposal correctly, I guess you are talking about making 
the configuration itself 'polymorphic' with respect to the policy.. If so, I 
totally agree with you, but I guess that deserves its own (umbrella?) JIRA to 
do it proper justice.. Thoughts ?

w.r.t the parameters passed down in the {{NodeHeartBeatResponse}} to be more 
general than "queuLimit" / renaming the {{ContainerQueuingLimt}} to 
{{ContainerQueueingCommand}} ..
Again.. great suggestion.. but given that currently, we only have a single 
'command' being passed down, I did not want to increase the complexity of the 
patch. If you are fine with it, I can maybe raise a separate JIRA when we have 
atleast one other command that needs to be passed down from the RM.

bq. in the .proto it would likely help other devs if you say 
max_wait_time_in_ms or something like that, which indicates time granularity. 
Also is int32 always enough?
I've added the 'in_ms' suffix.. The upper limit of int32 expressed in ms is 24 
days.. Given that this feature is targeted at short living tasks, I feel we can 
keep it as int32.

bq. Is it reasonable to assume the caller of QueueLimitCalculator.update() will 
synchronize on topKNodes?
I guess do...The QueueLimitCalculator was designed to be a helper class of 
NodeManagerQueueMonitor, and only the NMQM (since it is package private) can 
call update, which it does within a synchronized scope.

bq. If topKNodes is << than total nodes, you could create a local list...
I apologize for the confusion, the Calculator has to actually go thru ALL the 
nodes, not just the top K.. I have fixed this in the latest patch.

With regard to the MEDIAN metric, I initially included it since it is less 
susceptible to major variations from outliers, but since we have a max and min, 
don't this it is required.. I have updated the patch to remove median.

I have updated the patch with the rest of your suggestions.

Do take a look at the latest patch, and let me know if you are fine with the 
changes.

> Corrective mechanisms for rebalancing NM container queues
> ---------------------------------------------------------
>
>                 Key: YARN-2888
>                 URL: https://issues.apache.org/jira/browse/YARN-2888
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Konstantinos Karanasos
>            Assignee: Arun Suresh
>         Attachments: YARN-2888-yarn-2877.001.patch, 
> YARN-2888-yarn-2877.002.patch, YARN-2888.003.patch, YARN-2888.004.patch, 
> YARN-2888.005.patch, YARN-2888.006.patch
>
>
> Bad queuing decisions by the LocalRMs (e.g., due to the distributed nature of 
> the scheduling decisions or due to having a stale image of the system) may 
> lead to an imbalance in the waiting times of the NM container queues. This 
> can in turn have an impact in job execution times and cluster utilization.
> To this end, we introduce corrective mechanisms that may remove (whenever 
> needed) container requests from overloaded queues, adding them to less-loaded 
> ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to