[ https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999584#comment-13999584 ]
Hong Zhiguo commented on YARN-2027: ----------------------------------- I did it in YARN-1974 to specify nodes on which the containers should be allocated(for fair and capacity scheduler), and it works both in unit test and in our real cluster. > YARN ignores host-specific resource requests > -------------------------------------------- > > Key: YARN-2027 > URL: https://issues.apache.org/jira/browse/YARN-2027 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler > Affects Versions: 2.4.0 > Environment: RHEL 6.1 > YARN 2.4 > Reporter: Chris Riccomini > > YARN appears to be ignoring host-level ContainerRequests. > I am creating a container request with code that pretty closely mirrors the > DistributedShell code: > {code} > protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) > { > info("Requesting %d container(s) with %dmb of memory" format (containers, > memMb)) > val capability = Records.newRecord(classOf[Resource]) > val priority = Records.newRecord(classOf[Priority]) > priority.setPriority(0) > capability.setMemory(memMb) > capability.setVirtualCores(cpuCores) > // Specifying a host in the String[] host parameter here seems to do > nothing. Setting relaxLocality to false also doesn't help. > (0 until containers).foreach(idx => amClient.addContainerRequest(new > ContainerRequest(capability, null, null, priority))) > } > {code} > When I run this code with a specific host in the ContainerRequest, YARN does > not honor the request. Instead, it puts the container on an arbitrary host. > This appears to be true for both the FifoScheduler and the CapacityScheduler. > Currently, we are running the CapacityScheduler with the following settings: > {noformat} > <configuration> > <property> > <name>yarn.scheduler.capacity.maximum-applications</name> > <value>10000</value> > <description> > Maximum number of applications that can be pending and running. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> > <value>0.1</value> > <description> > Maximum percent of resources in the cluster which can be used to run > application masters i.e. controls number of concurrent running > applications. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.resource-calculator</name> > > <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> > <description> > The ResourceCalculator implementation to be used to compare > Resources in the scheduler. > The default i.e. DefaultResourceCalculator only uses Memory while > DominantResourceCalculator uses dominant-resource to compare > multi-dimensional resources such as Memory, CPU etc. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.root.queues</name> > <value>default</value> > <description> > The queues at the this level (root is the root queue). > </description> > </property> > <property> > <name>yarn.scheduler.capacity.root.default.capacity</name> > <value>100</value> > <description>Samza queue target capacity.</description> > </property> > <property> > <name>yarn.scheduler.capacity.root.default.user-limit-factor</name> > <value>1</value> > <description> > Default queue user limit a percentage from 0.0 to 1.0. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> > <value>100</value> > <description> > The maximum capacity of the default queue. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.root.default.state</name> > <value>RUNNING</value> > <description> > The state of the default queue. State can be one of RUNNING or STOPPED. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name> > <value>*</value> > <description> > The ACL of who can submit jobs to the default queue. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name> > <value>*</value> > <description> > The ACL of who can administer jobs on the default queue. > </description> > </property> > <property> > <name>yarn.scheduler.capacity.node-locality-delay</name> > <value>40</value> > <description> > Number of missed scheduling opportunities after which the > CapacityScheduler > attempts to schedule rack-local containers. > Typically this should be set to number of nodes in the cluster, By > default is setting > approximately number of nodes in one rack which is 40. > </description> > </property> > </configuration> > {noformat} > Digging into the code a bit (props to [~jghoman] for finding this), we have a > theory as to why this is happening. It looks like > RMContainerRequestor.addContainerReq adds three resource requests per > container request: data-local, rack-local, and any: > {code} > protected void addContainerReq(ContainerRequest req) { > // Create resource requests > for (String host : req.hosts) { > // Data-local > if (!isNodeBlacklisted(host)) { > addResourceRequest(req.priority, host, req.capability); > } > } > // Nothing Rack-local for now > for (String rack : req.racks) { > addResourceRequest(req.priority, rack, req.capability); > } > // Off-switch > addResourceRequest(req.priority, ResourceRequest.ANY, req.capability); > } > {code} > The addResourceRequest method, in turn, calls addResourceRequestToAsk, which > in turn calls ask.add(remoteRequest): > {code} > private void addResourceRequestToAsk(ResourceRequest remoteRequest) { > // because objects inside the resource map can be deleted ask can end up > // containing an object that matches new resource object but with different > // numContainers. So exisintg values must be replaced explicitly > if(ask.contains(remoteRequest)) { > ask.remove(remoteRequest); > } > ask.add(remoteRequest); > } > {code} > The problem is that the "ask" variable is a TreeSet: > {code} > private final Set<ResourceRequest> ask = new TreeSet<ResourceRequest>( > new > org.apache.hadoop.yarn.api.records.ResourceRequest.ResourceRequestComparator()); > {code} > The ResourceRequestComparator sorts the TreeSet according to: > {code} > public int compare(ResourceRequest r1, ResourceRequest r2) { > // Compare priority, host and capability > int ret = r1.getPriority().compareTo(r2.getPriority()); > if (ret == 0) { > String h1 = r1.getResourceName(); > String h2 = r2.getResourceName(); > ret = h1.compareTo(h2); > } > if (ret == 0) { > ret = r1.getCapability().compareTo(r2.getCapability()); > } > return ret; > } > {code} > The first thing to note is that our resource requests all have the same > priority, so the TreeSet is really sorted by resource name (host/rack). The > resource names that are added as part of addContainerReq are host, rack, and > any, which is denoted as "\*" (see above). The problem with this is that the > TreeSet is going to sort the resource requests with the "\*" request first, > even if the host request was added first in addContainerReq. > {code} > > import java.util.TreeSet > > val set = new TreeSet[String] > set: java.util.TreeSet[String] = [] > > set.add("eat1-app") > > set > res3: java.util.TreeSet[String] = [eat1-app] > > set.add("*") > > set > res5: java.util.TreeSet[String] = [*, eat1-app] > {code} > From here on out, it seems to me that anything interacting with the "ask" > TreeSet (including the allocation requests) will be using the most general > resource request, not the most specific. -- This message was sent by Atlassian JIRA (v6.2#6252)