Karthik Kambatla created YARN-2063:

             Summary: ZKRMStateStore: Better handling of operation failures
                 Key: YARN-2063
                 URL: https://issues.apache.org/jira/browse/YARN-2063
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.4.0
            Reporter: Karthik Kambatla
            Assignee: Karthik Kambatla
            Priority: Critical

Today, when a ZK operation fails, we handle connection-loss and 
operation-timeout the same way. This could definitely use some improvements:
# Add special handling for other error codes
# Connection-loss: Nullify zkClient, so a new connection is established
# Operation-timeout: Retry a few times with exponential delay?

This message was sent by Atlassian JIRA

Reply via email to