MENG DING commented on YARN-1902:

I was almost going to log the same issue when I saw this thread (and also 
YARN-3020) :-).

After reading all the discussions, and after reading the related code, I still 
believe this is a bug.

I understand what [~bikassaha] has said that the AM-RM protocol is NOT a delta 
protocol, and that currently user (i.e., ApplicationMaster) is responsible for 
calling removeContainerRequest() after receiving an allocation, but consider 
the follow simple modification to the packaged *distributedshell* application:

@@ -805,6 +805,8 @@ public void onContainersAllocated(List<Container> 
allocatedContainers) {
         // as all containers may not be allocated at one go.
+        ContainerRequest containerAsk = setupContainerAskForRM();
+        amRMClient.removeContainerRequest(containerAsk);

The code simply removes a container request after successfully receiving an 
allocated container in the ApplicationMaster. When you submit this application 
by specifying, say, 3 containers on the CLI, you will sometimes get 4 
containers allocated (not counting the AM container)! 

root@node2:~# hadoop 
org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
 -shell_command "sleep 100000" -num_containers 3 -timeout 200000000
root@node2:~# yarn container -list appattempt_1431531743796_0015_000001
15/05/15 20:49:01 INFO client.RMProxy: Connecting to ResourceManager at 
15/05/15 20:49:01 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total number of containers :5
                  Container-Id            Start Time             Finish Time    
               State                    Host       Node Http Address            
container_1431531743796_0015_01_000005  Fri May 15 20:44:12 +0000 2015          
         N/A                 RUNNING             node3:50093        
container_1431531743796_0015_01_000001  Fri May 15 20:44:06 +0000 2015          
         N/A                 RUNNING             node3:50093        
container_1431531743796_0015_01_000002  Fri May 15 20:44:10 +0000 2015          
         N/A                 RUNNING             node3:50093        
container_1431531743796_0015_01_000004  Fri May 15 20:44:11 +0000 2015          
         N/A                 RUNNING             node3:50093        
container_1431531743796_0015_01_000003  Fri May 15 20:44:10 +0000 2015          
         N/A                 RUNNING             node4:41128        

The *fundamental* problem here, I believe, is that the AMRMClient maintains an 
internal request table *remoteRequestsTable* that keeps track of *total* 
container requests (i.e., including container requests that have been 
satisfied, and that are not yet satisfied):

protected final 
  Map<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>>
    remoteRequestsTable =
    new TreeMap<Priority, Map<String, TreeMap<Resource, 

However, the corresponding table *requests* at the scheduler side (inside 
AppSchedulingInfo.java) keeps track of *outstanding* container requests (i.e., 
container requests that are not yet satisfied):

  final Map<Priority, Map<String, ResourceRequest>> requests =
    new ConcurrentHashMap<Priority, Map<String, ResourceRequest>>();

Every time an allocation is successfully made, the decResourceRequest() or 
decrementOutstanding() call will update the *requests* table so that it only 
contains outstanding requests, but unfortunately, every time an 
ApplicationMaster heartbeat comes, the same *requests* table is updated by the 
updateResourceRequests() call with the total requests coming from AMRMClient.

This inconsistent view of total requests from AMRMClient side, and the 
outstanding requests from the Scheduler side, in my opinion, is very confusing 
to say the least. 

I see that a solution has already been proposed by [~wangda] in YARN-3020, 
which I think is the correct thing to do:
maybe we should add a default implementation to deduct pending resource 
requests by prioirty/resource-name/capacity of allocated containers 
automatically (User can disable this default behavior, implement their own 
logic to deduct pending resource requests.)

This solution will make *remoteRequestsTable* in AMRMClient only keep track of 
outstanding container requests, which is then consistent with the *requests* 
table at the Scheduler side.

Any comments or thoughts? We are currently investigating YARN-1197, and are 
faced with a similar issue with properly tracking container resource increase 
requests at both client and server side.


> Allocation of too many containers when a second request is done with the same 
> resource capability
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-1902
>                 URL: https://issues.apache.org/jira/browse/YARN-1902
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.2.0, 2.3.0, 2.4.0
>            Reporter: Sietse T. Au
>            Assignee: Sietse T. Au
>              Labels: client
>         Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map<Resource, 
> ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.

This message was sent by Atlassian JIRA

Reply via email to