[
https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335017#comment-17335017
]
Jim Brennan commented on YARN-10738:
------------------------------------
[~zhuqi], I am not very familiar with the multi-threaded scheduling code - we
have not started using it yet. So it would be very helpful if you could
provide more details about what you are observing in your cluster, and how you
think this will fix it. Is your cluster made up of many nodes that are the
same size, or do you have a mix of different sizes? If you have any data that
shows some nodes being more heavily utilized than others, that would be helpful.
Looking at {{ResourceUsageMultiNodeLookupPolicy}}, it seems to sort by
allocated resources to a node, so this seems to be trying to ensure we allocate
more evenly across nodes. It doesn't consider the relative sizes of the nodes
though, so in a heterogenous cluster, I could see it leading to smaller nodes
being busier than larger nodes. I wonder if a reverse sort by unallocated
resources might be more fair, because it would favor nodes that have more room
for new resource requests, rather than those that currently have fewer
resources allocated.
Another option to consider would be to have a policy that uses node
utilization, which should more accurately reflect how busy the node is.
With respect to the policy proposed in this ticket, I am not convinced it will
help very much? It's doing the same sort by allocated resources, but just
adding a shuffle of every 10 nodes. I'm not sure how much that will help in
practice on a large cluster. A rack is usually more than 10 nodes, so it's
possible the same set of racks will be over-utilized. Again, it would be
helpful if you had some before/after data to show how it helps in a real
cluster.
> When multi thread scheduling with multi node, we should shuffle with a gap to
> prevent hot accessing nodes.
> ----------------------------------------------------------------------------------------------------------
>
> Key: YARN-10738
> URL: https://issues.apache.org/jira/browse/YARN-10738
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Major
> Labels: pull-request-available
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Now the multi threading scheduling with multi node is not reasonable.
> In large clusters, it will cause the hot accessing nodes, which will lead the
> abnormal boom node.
> Solution:
> I think we should shuffle the sorted node (such the available resource sort
> policy) with an interval.
> I will solve the above problem, and avoid the hot accessing node.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]