[
https://issues.apache.org/jira/browse/YARN-9863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949769#comment-16949769
]
David Mollitor commented on YARN-9863:
--------------------------------------
[~szegedim] Thank you for your feedback.
The background here is that I am working with a large cluster that has one job
in particular that is crushing it. This one job is required to localize many
resources, of varying file sizes, for the job to complete. As I understand
YARN, when a job is submitted to the cluster, a list of files to localize is
sent to each NodeManager involved in the job. In this case, all nodes are
involved. All NodeManagers receive a carbon copy of the list of files from the
ResourceManager (or maybe it's the 'yarn' client?). That is, they all have the
same list, with the same ordering. The NodeManager then iterate through the
list and request that each file be localized.
So, it would seem to me that all of the NodeManagers would request from HDFS
file1, file2, file3, ...
This would have a stampeding affect on the HDFS DataNodes.
I am familiar with {{mapreduce.client.submit.file.replication}}. I understand
that this is used to pump-up the replication of the submitted files so that
they are available on more DataNodes. However, the way that it works, as I
understand it, is that the file is first written to the HDFS cluster with the
default replication (usually 3), and then the client requests that the file be
replicated up to the final size in a separate request (setrep). This
replication process happens asynchronously. If the
{{mapreduce.client.submit.file.replication}} is set to 10, for example, the job
may be submitted and finished before the file actually achieves a final
replication of 10. This becomes exacerbated on larger clusters. If a cluster
has 1,000 nodes, the recommended value of
{{mapreduce.client.submit.file.replication}} is sqrt(1000) or ~32. The default
number of connections each DataNode can support is 10
({{dfs.datanode.handler.count}}). So, even if the desired replication is
achieved, that is 32 x 10 connections = 320 connections supported at once. In a
cluster with 1,000 nodes, that is going to stall.
By simply randomizing the list, the load can be spread across many sets of 32
nodes and better support this scenario.
For your questions:
# I'm not sure how HDFS would manage this. The requests are generated by the
NodeManagers and the HDFS cluster is simply serving. They have no way to
randomize the requests.
# SecureRandom. This is not a secure operation. It only requires a fast and
pretty-good randomization of the list to spread the load
# I believe that the parallel nature of the localization is configurable with
{{yarn.nodemanager.localizer.fetch.thread-count}} (default 4), but I believe
that the requests are submitted to a work-queue in order, so there will still
be some level of trampling, especially if there are more than 4 files to
localize (as is this case with the scenario I am reviewing)
> Randomize List of Resources to Localize
> ---------------------------------------
>
> Key: YARN-9863
> URL: https://issues.apache.org/jira/browse/YARN-9863
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Minor
> Attachments: YARN-9863.1.patch, YARN-9863.2.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/util/LocalResourceBuilder.java
> Add a new parameter to {{LocalResourceBuilder}} that allows the list of
> resources to be shuffled randomly. This will allow the Localizer to spread
> the load of requests so that not all of the NodeManagers are requesting to
> localize the same files, in the same order, from the same DataNodes,
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]