[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101914.patch) > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-102014.patch > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101914.patch Hi [~zjshen], thanks for your review! I addressed your comments, and rebased the patch with the latest trunk. If you have time please feel free to take a look. Thanks! > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-101914.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101714.patch For some unknown reasons, Jenkins executed a wrong set of unit tests. Try to kick it again to see if the problem is temporary. > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch, YARN-2673-101714.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101714.patch) > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101714.patch Hi [~zjshen], I've updated my patch according to your comments. I've also fixed a bug in the previous version: in the previous patch I confused "maxRetries" with "maxTries", and issues one less attempt in the retry filter. According to your comments: 1. Made retried, maxRetries and retryInterval \@VisibleForTesting. bq. After retried is set to true first time. It is always true, which means it's not useful for asserting the second request. This is a bug. retried should indicate if retry happened in the last jersey request. I've fixed this issue in this patch by resetting retried every time a request is launched (and the client filter is called). 2. Fixed. 3. maxRetries can be -1 to indicate there is no limit for the number of retries (described in TimelineJerseyRetryFilter). I've added a line of comment here to make it clearer (also a line in the original configuration). 4. Fixed. 5. I think you raised a very valid point. I've removed this new API. > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch, YARN-2673-101714.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2673: -- Summary: Add retry for timeline client put APIs (was: Add retry for timeline client) > Add retry for timeline client put APIs > -- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-2.patch > Add retry for timeline client > - > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101414-2.patch) > Add retry for timeline client > - > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-2.patch Debugging the UT failure. > Add retry for timeline client > - > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-1.patch Address the comments from findbugs, and retry the unit test failure. Could not reproduce the UT failure locally. > Add retry for timeline client > - > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414.patch Upload a patch for this issue. TimelineClient will by default retry for a given amount of time before throw the exception on posting to server. There are a few notes: 1. Retrying vs. discarding timeline data: If we do not adding this retry, timeline client will drop the posted data if the first attempt has failed. Had a offline discussion with [~vinodkv]. We agreed that blocking the timeline client for a short while is better, since we may not want to drop some critical timeline data. 2. Retry behavior configurations: Users can define maximum retry counts, and time interval between consecutive retries. We may want to have two levels of retry settings: a cluster global settings, managed by yarn-site.xml, and a per-application customize setting. For the cluster setting, I've added two configuration properties, yarn.timeline-service.client.max-retries (default 30) and yarn.timeline-service.client.retry-interval-ms (default 1000). I've also provide a customizeRetrySettings method for application specific retry settings. 3. Retry implementation: timeline client does not use RPC, but uses RESTful APIs. I'm implementing retry as a jersey filter in this patch. 4. Tests: I added two new unit tests, one to test the customizeRetrySettings API and the other to test if the retry has actually happened when we try to post timeline entities. > Add retry for timeline client > - > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2673: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1530 > Add retry for timeline client > - > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)