Junping Du created YARN-4429:
--------------------------------

             Summary: RetryPolicies (other than 
FailoverOnNetworkExceptionRetry) should put on retry failed reason or the log 
from RMProxy's retry could be very misleading.
                 Key: YARN-4429
                 URL: https://issues.apache.org/jira/browse/YARN-4429
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.7.0, 2.6.0
            Reporter: Junping Du
            Assignee: Junping Du


In debugging a NM retry connection to RM (non-HA), the NM log during RM down 
time is very misleading:
{noformat}
2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
{noformat}
It actually only log client side retry on NetworkConnection failure but not 
include any info on RetryInvocationHandler where the real retry policy works. 
From the code below in RetryInvocationHandler.java, even the retry ends, we 
don't put warn messages to include how much/many time/ counts we spent on retry 
logic that make it harder to debug.

{code}
        if (failAction != null) {
          if (failAction.reason != null) {
            LOG.warn("Exception while invoking " + currentProxy.proxy.getClass()
                + "." + method.getName() + " over " + currentProxy.proxyInfo
                + ". Not retrying because " + failAction.reason, ex);
          }
          throw ex;
        }
{code}
We should add failAction.reason as much as we can in multiple retry policies. 
In addition, we should keep consistent in log level for message during the 
retry attempts: now the ipc.client is INFO, but RetryInvocationHandler is DEBUG 
(if not fail_over). We should keep them consistent or it could be very 
confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to