Neelesh Srinivas Salian commented on YARN-4185:

1) Using the exponentialBackoffRetry policy will have a progression of wait 
time starting at 1sec per retry assuming it takes a second for the NM to come 
Hence exponentially, the backoff time increases 2,4,8,16...till 512 as we 
approach 10 retries.

2) In the current strategy, the wait time is 10 seconds which causes an NM that 
restarted in 1 second to wait for a retry.

3) In the event of the retries going forward, at the 3rd retry ( the wait time 
is collectively 7 seconds (1+2+4) as per the exponential strategy) and (30 
(10+10+10) seconds as the current static retry)

4) If you keep retrying, collectively the waiting static retry has now waited 
for 60 seconds versus 2^6 = 64 seconds in the exponential strategy at the 6th 
retry attempt.

Logic for the Design:
1) In the event of retries being default to 10, 
   a. I propose after the 3rd attempt, we continue to keep the wait time as 4 
seconds and continue the same. 
   Thus the total time comes up to 1,2,4,4,4,4,4,4,4,4 = 35 seconds.
   b. Versus collectively spending 100 seconds on waiting time in the static 
retry strategy.

2) Alternatively, the logic could be:
   a. Have the 1st 3 attempts of retry. If further needed, fall back to the 
1sec start of the same logic.
      So, it looks like this.. (1,2,4)  (1,2,4)  (1,2,4) (1) for 10 retries.
   b. Thus we get the 10 retries done in collectively 22 seconds versus 100 

Requesting feedback.
Thank you.

> Retry interval delay for NM client can be improved from the fixed static 
> retry 
> -------------------------------------------------------------------------------
>                 Key: YARN-4185
>                 URL: https://issues.apache.org/jira/browse/YARN-4185
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Anubhav Dhoot
>            Assignee: Neelesh Srinivas Salian
> Instead of having a fixed retry interval that starts off very high and stays 
> there, we are better off using an exponential backoff that has the same fixed 
> max limit. Today the retry interval is fixed at 10 sec that can be 
> unnecessarily high especially when NMs could rolling restart within a sec.

This message was sent by Atlassian JIRA

Reply via email to