[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 闫昆 updated YARN-514: Component/s: (was: resourcemanager) Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Zhijie Shen Fix For: 2.1.0-beta Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch, YARN-514.5.patch, YARN-514.6.patch, YARN-514.7.patch, YARN-514.8.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-514: - Component/s: resourcemanager Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Fix For: 2.1.0-beta Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch, YARN-514.5.patch, YARN-514.6.patch, YARN-514.7.patch, YARN-514.8.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.8.patch In the newest patch, I use app directly. I checked the patch of the related M/R jira. It can be applied and work together with the patch in this jira. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch, YARN-514.5.patch, YARN-514.6.patch, YARN-514.7.patch, YARN-514.8.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.6.patch Thank @Bikas for your investigation. I've modified the code. The newest patch contain the following major updates: 1. FAILED = FAILED transition on RMAppEventType.APP_SAVED and KILLED = KILLED transition on RMAppEventType.APP_SAVED are defined. It fixes the problem pointed by @Bikas. 2. In addition, I found there's a problem in RMApp state transition in the RM restarting scenario. The stored MRApp will be recovered, an RMApp instance will be created, it will transit to NEW_SAVING and be stored again with the previous patch. To fix the problem, isRecovered is defined in RMAppImpl, and is set to true when RMAppImpl#recover is called. Then, on RMAppEventType.START being received, NEW = NEW_SAVING if the RMApp instance is not recovered, NEW = SUBMITTED otherwise. 3. Addition test cases are added in TestRMAppTransitions to test the aforementioned transition rules. 4. TestRMRestart should have traced the problem of saving the RMApp instance which is recovered again. However, it didn't failed the test case with previous patch because MemoryRMStateStore didn't throw exceptions when storing a duplicate application/attempt. Therefore, in the newest patch, MemoryRMStateStore will through IOException when the application/attempt has already been stored, which is consistent with the behavior of FileSystemRMStateStore. Then, the current test case of TestRMRestart can trace the problem of saving the RMApp instance twice. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch, YARN-514.5.patch, YARN-514.6.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.5.patch I've drafted a newer patch, where YarnApplicationState, YarnApplicationStateProto and RMAppState (RMAppState has one more state than the other two) have consistent state orders: NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch, YARN-514.5.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.4.patch Fix the incorrect indents. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.2.patch Update TestRMAppTransitions to avoid the bug of determining the transit SAVING state. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.3.patch I've updated the patch. the major modifications are as follows: 1. SAVING is renamed as NEW_SAVING to be more clear. 2. On receiving RMAppEventType.START, RMApp transits from NEW to NEW_SAVING, and RMAppSavingTransition is executed, where storeApplication is invoked. On receiving RMAppEventType.APP_SAVED (sent from RMStateStore), RMApp transits from NEW_SAVING to SUBMITTED, and StartAppAttemptTransition is executed, where application store exception is checked before creating a new attempt. Therefore, the states of RMApp from SUBMITTED are just moved a step behind without any more changes. 3. TestRMAppTransitions has been significantly simplified. Only the transition related tests for the newly added state is included here. In addition, I've done the single-node cluster test, and verified that application store occurs before attempt store. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-514: - Attachment: YARN-514.1.patch In this patch, I've changed RMStateStore#storeApplication from blocking API to non-blocking API. Therefore, it is no longer necessary to invoke the API in ClientRMService#submitApplication. Instead, I defined a new state, named SAVING, between NEW and SUBMITTED of RMApp. TestRMAppTransitions were modified to test the additional state transition, and to test whether the application is stored before SUBMITTED and removed after FINISHED. An additional issue is that the mapping between yarn and mapreduce states needs to be updated due to the newly added state. This will be filed and solved in a separate MR jira. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira