[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-102014.patch Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101914.patch) Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101914.patch Hi [~zjshen], thanks for your review! I addressed your comments, and rebased the patch with the latest trunk. If you have time please feel free to take a look. Thanks! Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-101914.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101714.patch Hi [~zjshen], I've updated my patch according to your comments. I've also fixed a bug in the previous version: in the previous patch I confused maxRetries with maxTries, and issues one less attempt in the retry filter. According to your comments: 1. Made retried, maxRetries and retryInterval \@VisibleForTesting. bq. After retried is set to true first time. It is always true, which means it's not useful for asserting the second request. This is a bug. retried should indicate if retry happened in the last jersey request. I've fixed this issue in this patch by resetting retried every time a request is launched (and the client filter is called). 2. Fixed. 3. maxRetries can be -1 to indicate there is no limit for the number of retries (described in TimelineJerseyRetryFilter). I've added a line of comment here to make it clearer (also a line in the original configuration). 4. Fixed. 5. I think you raised a very valid point. I've removed this new API. Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101714.patch) Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101714.patch For some unknown reasons, Jenkins executed a wrong set of unit tests. Try to kick it again to see if the problem is temporary. Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2673: -- Summary: Add retry for timeline client put APIs (was: Add retry for timeline client) Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414.patch Upload a patch for this issue. TimelineClient will by default retry for a given amount of time before throw the exception on posting to server. There are a few notes: 1. Retrying vs. discarding timeline data: If we do not adding this retry, timeline client will drop the posted data if the first attempt has failed. Had a offline discussion with [~vinodkv]. We agreed that blocking the timeline client for a short while is better, since we may not want to drop some critical timeline data. 2. Retry behavior configurations: Users can define maximum retry counts, and time interval between consecutive retries. We may want to have two levels of retry settings: a cluster global settings, managed by yarn-site.xml, and a per-application customize setting. For the cluster setting, I've added two configuration properties, yarn.timeline-service.client.max-retries (default 30) and yarn.timeline-service.client.retry-interval-ms (default 1000). I've also provide a customizeRetrySettings method for application specific retry settings. 3. Retry implementation: timeline client does not use RPC, but uses RESTful APIs. I'm implementing retry as a jersey filter in this patch. 4. Tests: I added two new unit tests, one to test the customizeRetrySettings API and the other to test if the retry has actually happened when we try to post timeline entities. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-1.patch Address the comments from findbugs, and retry the unit test failure. Could not reproduce the UT failure locally. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-2.patch Debugging the UT failure. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101414-2.patch) Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-2.patch Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2673: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1530 Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)