Botong Huang created YARN-7899:
----------------------------------

             Summary: [AMRMProxy] Stateful FederationInterceptor for pending 
requests
                 Key: YARN-7899
                 URL: https://issues.apache.org/jira/browse/YARN-7899
             Project: Hadoop YARN
          Issue Type: Task
            Reporter: Botong Huang
            Assignee: Botong Huang


Today FederationInterceptor (in AMRMProxy for YARN Federation) is stateless in 
terms of pending (outstanding) requests. Whenever AM issues new requests, FI 
simply splits and sends them to sub-cluster YarnRMs and forget about them. This 
JIRA attempts to make FI stateful so that it remembers the pending requests in 
all relevant sub-clusters. This has two major benefits: 

1. It is a prerequisite for FI to be able to cancel pending request in one 
sub-cluster and re-send it to other sub-clusters. This is needed for load 
balancing and to fully comply with the relax locality fallback to ANY semantic. 
When we send a request to one sub-cluster, we have effectively restrained the 
allocation for this request to be within this sub-cluster rather than 
everywhere. If the cluster capacity in this sub-cluster for this app is full or 
this YarnRM is overloaded and slow, the request will be stuck there for a long 
time even if there is free capacity in other sub-clusters. We need FI to 
remember and adjust the pending requests on the fly. 

2. This makes pending request recovery easier when YarnRM fails over. Today 
whenever one sub-cluster RM fails over, in order to recover lost pending 
requests for this sub-cluster, 
we have to propagate the ApplicationMasterNotRegisteredException from the 
YarnRM back to AM, triggering a full pending resend from AM. This contains 
pending for not only the failing-over sub-cluster, but everyone. Since our 
split-merge (AMRMProxyPolicy) does not guarantee idempotency, the same request 
we sent to sub-cluster-1 earlier might be resent to sub-cluster-2. If both 
these YarnRMs have not failed over, they will both allocate for this request, 
leading to over-allocation. Also, these full pending resends also puts 
unnecessary load on every YarnRM in the cluster everytime one YarnRM fails 
over. With stateful FederationInterceptor, since we remember pending requests 
we have sent out earlier, we can shield the 
ApplicationMasterNotRegisteredException for AM and resend the pending only to 
the failed over YarnRM. This eliminates over-allocation and minimizes the 
recovery overhead. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to