[jira] [Commented] (YARN-618) Modify RM_INVALID_IDENTIFIER to a -ve number
[ https://issues.apache.org/jira/browse/YARN-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647425#comment-13647425 ] Hudson commented on YARN-618: - Integrated in Hadoop-Yarn-trunk #201 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/201/]) YARN-618. Modified RM_INVALID_IDENTIFIER to be -1 instead of zero. Contributed by Jian He. (Revision 1478230) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1478230 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/ResourceManagerConstants.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestEventFlow.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java Modify RM_INVALID_IDENTIFIER to a -ve number - Key: YARN-618 URL: https://issues.apache.org/jira/browse/YARN-618 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.0.5-beta Attachments: YARN-618.1.patch, YARN-618.2.patch, YARN-618.3-branch-2.patch, YARN-618.3.patch, YARN-618.patch RM_INVALID_IDENTIFIER set to 0 doesnt sound right as many tests set it to 0. Probably a -ve number is what we want. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-618) Modify RM_INVALID_IDENTIFIER to a -ve number
[ https://issues.apache.org/jira/browse/YARN-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647489#comment-13647489 ] Hudson commented on YARN-618: - Integrated in Hadoop-Hdfs-trunk #1390 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1390/]) YARN-618. Modified RM_INVALID_IDENTIFIER to be -1 instead of zero. Contributed by Jian He. (Revision 1478230) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1478230 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/ResourceManagerConstants.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestEventFlow.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java Modify RM_INVALID_IDENTIFIER to a -ve number - Key: YARN-618 URL: https://issues.apache.org/jira/browse/YARN-618 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.0.5-beta Attachments: YARN-618.1.patch, YARN-618.2.patch, YARN-618.3-branch-2.patch, YARN-618.3.patch, YARN-618.patch RM_INVALID_IDENTIFIER set to 0 doesnt sound right as many tests set it to 0. Probably a -ve number is what we want. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-618) Modify RM_INVALID_IDENTIFIER to a -ve number
[ https://issues.apache.org/jira/browse/YARN-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647544#comment-13647544 ] Hudson commented on YARN-618: - Integrated in Hadoop-Mapreduce-trunk #1417 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1417/]) YARN-618. Modified RM_INVALID_IDENTIFIER to be -1 instead of zero. Contributed by Jian He. (Revision 1478230) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1478230 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/ResourceManagerConstants.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestEventFlow.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/BaseContainerManagerTest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java Modify RM_INVALID_IDENTIFIER to a -ve number - Key: YARN-618 URL: https://issues.apache.org/jira/browse/YARN-618 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.0.5-beta Attachments: YARN-618.1.patch, YARN-618.2.patch, YARN-618.3-branch-2.patch, YARN-618.3.patch, YARN-618.patch RM_INVALID_IDENTIFIER set to 0 doesnt sound right as many tests set it to 0. Probably a -ve number is what we want. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-621) RM triggers web auth failure before first job
[ https://issues.apache.org/jira/browse/YARN-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647687#comment-13647687 ] Allen Wittenauer commented on YARN-621: --- Some more details: * With the exception of fixing the broken start-dfs.sh, this is a pure Apache 2.0.4 deploy on RHEL 6.3. * We configure logical nics and bind services to them. In order to work around HADOOP-9520, we have hard coded all the service names in the configuration. * After the first job, authentication works and the system works as expected. If that jobs rolls off the page, the replay errors return. * Same realm or cross-realm does not appear to make a difference. * The only stack trace I'm able to find is the one generated by the filter for the replay error itself. * Hit this in both Firefox and Safari (which also triggers HADOOP-9521). RM triggers web auth failure before first job - Key: YARN-621 URL: https://issues.apache.org/jira/browse/YARN-621 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Allen Wittenauer Assignee: Vinod Kumar Vavilapalli Priority: Critical On a secure YARN setup, before the first job is executed, going to the web interface of the resource manager triggers authentication errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Riccomini updated YARN-614: - Attachment: YARN-614-1.patch Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Attachments: YARN-614-0.patch, YARN-614-1.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647750#comment-13647750 ] Chris Riccomini commented on YARN-614: -- Added a new patch. Resolves 1 (switch justFinishedContainers to a map for O(1) container status look) and 3 (added a shouldIgnoreFailures method) in my list above. Bikas, I think we should leave recovery for another ticket. Do you want me to update RMAppManager.recover() to have the same if (app.attempts.size() - app.ignoredFailures = app.maxAppAttempts) logic as RMAppImpl.AttemptFailedTransition? Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Attachments: YARN-614-0.patch, YARN-614-1.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-636) Restore clientToken for app attempt after RM restart
[ https://issues.apache.org/jira/browse/YARN-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647769#comment-13647769 ] Bikas Saha commented on YARN-636: - If this is being fully covered in YARN-582 then please resolve this as duplicate. Restore clientToken for app attempt after RM restart Key: YARN-636 URL: https://issues.apache.org/jira/browse/YARN-636 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647770#comment-13647770 ] Chris Riccomini commented on YARN-614: -- Hey Bikas, Looking into the recovery stuff a bit more. As far as I can tell (still wrapping my head around this stuff), the RMApp's RECOVER transition moves from NEW to SUBMITTED right now. This transition is triggered by the RMAppManager.recover - RMAppManager.submitApplication, which sends the RECOVER event. The submitApplication call happens directly before appImpl.recover() in RMAppManager: bq. if(shouldRecover) { LOG.info(Recovering application + appState.getAppId()); submitApplication(appState.getApplicationSubmissionContext(), appState.getSubmitTime(), true); // re-populate attempt information in application RMAppImpl appImpl = (RMAppImpl) rmContext.getRMApps().get( appState.getAppId()); appImpl.recover(state); } This means that the RECOVER transition (StartAppAttemptTransition) happens before we have any state in the RMAppImpl. As a result, we can't add any logic to StartAppAttemptTransition to determine whether we should transition to FAILED at this point (since the attempts variable will be empty at this point). I think this means that we can't do your second suggestion (Another solution could be to make the RMApp go from NEW to FAILED in the recover transition based on failed counts etc.). Am I understanding this correctly? Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Attachments: YARN-614-0.patch, YARN-614-1.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-641) Make AMLauncher in RM Use NMClient
[ https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth reassigned YARN-641: -- Assignee: Chris Nauroth (was: Zhijie Shen) Make AMLauncher in RM Use NMClient -- Key: YARN-641 URL: https://issues.apache.org/jira/browse/YARN-641 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Chris Nauroth YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions with an application's AM container. AMLauncher should also replace the raw ContainerManager proxy with NMClient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-641) Make AMLauncher in RM Use NMClient
[ https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth reassigned YARN-641: -- Assignee: Zhijie Shen (was: Chris Nauroth) Accidentally assigned this to myself. Giving it back to Zhijie. Make AMLauncher in RM Use NMClient -- Key: YARN-641 URL: https://issues.apache.org/jira/browse/YARN-641 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions with an application's AM container. AMLauncher should also replace the raw ContainerManager proxy with NMClient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-638) Add RMDelegationTokens back to DelegationTokenSecretManager after RM Restart
[ https://issues.apache.org/jira/browse/YARN-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647914#comment-13647914 ] Jian He commented on YARN-638: -- Simply adding RMDelegationTokens back to DelegationTokenSecretManager is not enough. We also need to store the master keys, since renewToken method is using corresponding key of token to generate new password and verify the client is renewing token with correct password. The current solution for restoring RMDelegationTokens is to add a separate RMDelegationSecrectManagerStore in RMStateStore. What it does is to save the token and the master key whenever they are generated, and remove the states when token expires and key is rolled over Add RMDelegationTokens back to DelegationTokenSecretManager after RM Restart Key: YARN-638 URL: https://issues.apache.org/jira/browse/YARN-638 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-638.1.patch This is missed in YARN-581. After RM restart, RMDelegationTokens need to be added both in DelegationTokenRenewer (addressed in YARN-581), and delegationTokenSecretManager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-638) Restore RMDelegationTokens after RM Restart
[ https://issues.apache.org/jira/browse/YARN-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-638: - Summary: Restore RMDelegationTokens after RM Restart (was: Add RMDelegationTokens back to DelegationTokenSecretManager after RM Restart) Restore RMDelegationTokens after RM Restart --- Key: YARN-638 URL: https://issues.apache.org/jira/browse/YARN-638 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-638.1.patch This is missed in YARN-581. After RM restart, RMDelegationTokens need to be added both in DelegationTokenRenewer (addressed in YARN-581), and delegationTokenSecretManager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-581) Test and verify that app delegation tokens are added to tokenNewer after RM restart
[ https://issues.apache.org/jira/browse/YARN-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-581: - Summary: Test and verify that app delegation tokens are added to tokenNewer after RM restart (was: Test and verify that app delegation tokens are restored after RM restart) Test and verify that app delegation tokens are added to tokenNewer after RM restart --- Key: YARN-581 URL: https://issues.apache.org/jira/browse/YARN-581 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Fix For: 2.0.5-beta Attachments: YARN-581.1.patch, YARN-581.2.patch The code already saves the delegation tokens in AppSubmissionContext. Upon restart the AppSubmissionContext is used to submit the application again and so restores the delegation tokens. This jira tracks testing and verifying this functionality in a secure setup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-581) Test and verify that app delegation tokens are added to tokenNewer after RM restart
[ https://issues.apache.org/jira/browse/YARN-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647927#comment-13647927 ] Jian He commented on YARN-581: -- changed the title, this patch handles restoring delegationTokens in tokenNewer. restoring delegationTokens in delegationTokenSecretManager is addressed in YARN-638 Test and verify that app delegation tokens are added to tokenNewer after RM restart --- Key: YARN-581 URL: https://issues.apache.org/jira/browse/YARN-581 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Fix For: 2.0.5-beta Attachments: YARN-581.1.patch, YARN-581.2.patch The code already saves the delegation tokens in AppSubmissionContext. Upon restart the AppSubmissionContext is used to submit the application again and so restores the delegation tokens. This jira tracks testing and verifying this functionality in a secure setup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-636) Restore clientToken for app attempt after RM restart
[ https://issues.apache.org/jira/browse/YARN-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647941#comment-13647941 ] Jian He commented on YARN-636: -- Currently, this is not covered in YARN-582, I separated the patch Restore clientToken for app attempt after RM restart Key: YARN-636 URL: https://issues.apache.org/jira/browse/YARN-636 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-581) Test and verify that app delegation tokens are added to tokenRenewer after RM restart
[ https://issues.apache.org/jira/browse/YARN-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-581: - Summary: Test and verify that app delegation tokens are added to tokenRenewer after RM restart (was: Test and verify that app delegation tokens are added to tokenNewer after RM restart) Test and verify that app delegation tokens are added to tokenRenewer after RM restart - Key: YARN-581 URL: https://issues.apache.org/jira/browse/YARN-581 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Fix For: 2.0.5-beta Attachments: YARN-581.1.patch, YARN-581.2.patch The code already saves the delegation tokens in AppSubmissionContext. Upon restart the AppSubmissionContext is used to submit the application again and so restores the delegation tokens. This jira tracks testing and verifying this functionality in a secure setup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-629) Make YarnRemoteException not be rooted at IOException
[ https://issues.apache.org/jira/browse/YARN-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-629: --- Attachment: YARN-629.3.patch Make YarnRemoteException not be rooted at IOException - Key: YARN-629 URL: https://issues.apache.org/jira/browse/YARN-629 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-629.1.patch, YARN-629.2.patch, YARN-629.3.patch After HADOOP-9343, it should be possible for YarnException to not be rooted at IOException -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-629) Make YarnRemoteException not be rooted at IOException
[ https://issues.apache.org/jira/browse/YARN-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648021#comment-13648021 ] Xuan Gong commented on YARN-629: Drop test code to verify NMNotYetReadyException and InvalidContainerException, need to add them back after Yarn-142 Make YarnRemoteException not be rooted at IOException - Key: YARN-629 URL: https://issues.apache.org/jira/browse/YARN-629 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-629.1.patch, YARN-629.2.patch, YARN-629.3.patch After HADOOP-9343, it should be possible for YarnException to not be rooted at IOException -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-629) Make YarnRemoteException not be rooted at IOException
[ https://issues.apache.org/jira/browse/YARN-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648022#comment-13648022 ] Xuan Gong commented on YARN-629: Uploaded new patch to do all the MR change at MAPREDUCE-5204 Make YarnRemoteException not be rooted at IOException - Key: YARN-629 URL: https://issues.apache.org/jira/browse/YARN-629 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-629.1.patch, YARN-629.2.patch, YARN-629.3.patch After HADOOP-9343, it should be possible for YarnException to not be rooted at IOException -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-326) Add multi-resource scheduling to the fair scheduler
[ https://issues.apache.org/jira/browse/YARN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-326: Attachment: YARN-326-2.patch Add multi-resource scheduling to the fair scheduler --- Key: YARN-326 URL: https://issues.apache.org/jira/browse/YARN-326 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: FairSchedulerDRFDesignDoc-1.pdf, FairSchedulerDRFDesignDoc.pdf, YARN-326-1.patch, YARN-326-1.patch, YARN-326-2.patch, YARN-326.patch, YARN-326.patch With YARN-2 in, the capacity scheduler has the ability to schedule based on multiple resources, using dominant resource fairness. The fair scheduler should be able to do multiple resource scheduling as well, also using dominant resource fairness. More details to come on how the corner cases with fair scheduler configs such as min and max resources will be handled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-326) Add multi-resource scheduling to the fair scheduler
[ https://issues.apache.org/jira/browse/YARN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-326: Attachment: (was: YARN-326-2.patch) Add multi-resource scheduling to the fair scheduler --- Key: YARN-326 URL: https://issues.apache.org/jira/browse/YARN-326 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: FairSchedulerDRFDesignDoc-1.pdf, FairSchedulerDRFDesignDoc.pdf, YARN-326-1.patch, YARN-326-1.patch, YARN-326-2.patch, YARN-326.patch, YARN-326.patch With YARN-2 in, the capacity scheduler has the ability to schedule based on multiple resources, using dominant resource fairness. The fair scheduler should be able to do multiple resource scheduling as well, also using dominant resource fairness. More details to come on how the corner cases with fair scheduler configs such as min and max resources will be handled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-326) Add multi-resource scheduling to the fair scheduler
[ https://issues.apache.org/jira/browse/YARN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648025#comment-13648025 ] Sandy Ryza commented on YARN-326: - Uploaded a patch that fixes a couple bugs, includes more tests, and supports min and max share cpu configurations. If it passes Jenkins, it's ready for review. Add multi-resource scheduling to the fair scheduler --- Key: YARN-326 URL: https://issues.apache.org/jira/browse/YARN-326 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: FairSchedulerDRFDesignDoc-1.pdf, FairSchedulerDRFDesignDoc.pdf, YARN-326-1.patch, YARN-326-1.patch, YARN-326-2.patch, YARN-326.patch, YARN-326.patch With YARN-2 in, the capacity scheduler has the ability to schedule based on multiple resources, using dominant resource fairness. The fair scheduler should be able to do multiple resource scheduling as well, also using dominant resource fairness. More details to come on how the corner cases with fair scheduler configs such as min and max resources will be handled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-582) Restore appToken for app attempt after RM restart
[ https://issues.apache.org/jira/browse/YARN-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648037#comment-13648037 ] Bikas Saha commented on YARN-582: - Is the null check necessary? the underlying protobuf handles the null properly. {code} ByteBuffer appAttemptTokens = attemptState.getAppAttemptTokens(); if(appAttemptTokens != null){ attemptStateData.setAppAttemptTokens(appAttemptTokens); } {code} New public method necessary? RMAppAttemptImpl.recoverAppAttemptTokens() Looks like all changes in RMAppImpl are unnecessary. Bug in existing testDelegationTokenRestoredOnRMrestart(). The assert check should be made for rm1 and also for rm2. Right? {code} // start new RM MockRM rm2 = new TestSecurityMockRM(conf, memStore); rm2.start(); // verify tokens are properly populated back to DelegationTokenRenewer Assert.assertEquals(tokenSet, rm1.getRMContext() .getDelegationTokenRenewer().getDelegationTokens()); {code} Restore appToken for app attempt after RM restart - Key: YARN-582 URL: https://issues.apache.org/jira/browse/YARN-582 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-582.1.patch, YARN-582.2.patch, YARN-582.3.patch These need to be saved and restored on a per app attempt basis. This is required only when work preserving restart is implemented for secure clusters. In non-preserving restart app attempts are killed and so this does not matter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-582) Restore appToken for app attempt after RM restart
[ https://issues.apache.org/jira/browse/YARN-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648040#comment-13648040 ] Bikas Saha commented on YARN-582: - Also, we could be better off storing credentials in rmappattemptimpl and only converting to bytebuffer inside rmstore. currently, because of this in amlauncher we end up converting bytebuffer back to credentials. Restore appToken for app attempt after RM restart - Key: YARN-582 URL: https://issues.apache.org/jira/browse/YARN-582 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-582.1.patch, YARN-582.2.patch, YARN-582.3.patch These need to be saved and restored on a per app attempt basis. This is required only when work preserving restart is implemented for secure clusters. In non-preserving restart app attempts are killed and so this does not matter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-642) Fix up RMWebServices#getNodes
Sandy Ryza created YARN-642: --- Summary: Fix up RMWebServices#getNodes Key: YARN-642 URL: https://issues.apache.org/jira/browse/YARN-642 Project: Hadoop YARN Issue Type: Bug Components: api, resourcemanager Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza The code behind the /nodes RM REST API is unnecessarily muddled, logs the same misspelled INFO message repeatedly, and does not return unhealthy nodes, even when asked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-643) WHY appToken is removed both in BaseFinalTransition and AMUnregisteredTransition AND clientToken is removed in FinalTransition and not BaseFinalTransition
Jian He created YARN-643: Summary: WHY appToken is removed both in BaseFinalTransition and AMUnregisteredTransition AND clientToken is removed in FinalTransition and not BaseFinalTransition Key: YARN-643 URL: https://issues.apache.org/jira/browse/YARN-643 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-513) Create common proxy client for communicating with RM
[ https://issues.apache.org/jira/browse/YARN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648075#comment-13648075 ] Bikas Saha commented on YARN-513: - This will not work since for different protocols we have different ports on the RM. rmAddress cannot be passed into the methdo. Also, for failover case, the rmAddress needs to be determined internally. Based on protocol we need to find the correct address etc from conf and create the correct proxy object. {code} + public static T T createRMProxy(final Configuration conf, + final ClassT protocol, final InetSocketAddress rmAddress) {code} Can this code be written in the formate if(waitForever) {} else {}? It may be simpler {code} +RetryPolicy retryPolicy = +(waitForEver) ? RetryPolicies.RETRY_FOREVER : +RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS, +rmConnectionRetryIntervalMS, +TimeUnit.MILLISECONDS); +MapClass? extends Exception, RetryPolicy exceptionToPolicyMap = +new HashMapClass? extends Exception, RetryPolicy(); +exceptionToPolicyMap.put(java.net.ConnectException.class, retryPolicy); +exceptionToPolicyMap.put(java.io.EOFException.class, retryPolicy); +return (waitForEver) ? RetryPolicies.RETRY_FOREVER : +RetryPolicies.retryByException( +retryPolicy, exceptionToPolicyMap); {code} Same retryPolicy is being passed into exceptionmap and as the default value. Whats the use of the exceptionmap then? {code} +RetryPolicies.retryByException( +retryPolicy, exceptionToPolicyMap); {code} Any way to keep diagnostic error messages? I think if we dont rename NMStatusUpdater.getRMClient to createRMPRoxy then we dont need LocalRMProxy and most of the test code changes will also disappear. {code} - protected ResourceTracker getRMClient() { -Configuration conf = getConfig(); -YarnRPC rpc = YarnRPC.create(conf); -return (ResourceTracker) rpc.getProxy(ResourceTracker.class, rmAddress, -conf); + @VisibleForTesting + protected ResourceTracker createRMProxy(Configuration conf) + throws IOException { +return RMProxy.createRMProxy(conf, ResourceTracker.class, rmAddress); } {code} Create common proxy client for communicating with RM Key: YARN-513 URL: https://issues.apache.org/jira/browse/YARN-513 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-513.1.patch, YARN-513.2.patch, YARN-513.3.patch, YARN-513.4.patch, YARN.513.5.patch, YARN-513.6.patch When the RM is restarting, the NM, AM and Clients should wait for some time for the RM to come back up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-642) Fix up RMWebServices#getNodes
[ https://issues.apache.org/jira/browse/YARN-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-642: Labels: incompatible (was: ) Fix up RMWebServices#getNodes - Key: YARN-642 URL: https://issues.apache.org/jira/browse/YARN-642 Project: Hadoop YARN Issue Type: Bug Components: api, resourcemanager Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Labels: incompatible The code behind the /nodes RM REST API is unnecessarily muddled, logs the same misspelled INFO message repeatedly, and does not return unhealthy nodes, even when asked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer
Omkar Vinit Joshi created YARN-644: -- Summary: Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer Key: YARN-644 URL: https://issues.apache.org/jira/browse/YARN-644 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Priority: Minor I see that validation/ null check is not performed on passed in parameters. Ex. tokenId.getContainerID().getApplicationAttemptId() inside ContainerManagerImpl.authorizeRequest() I guess we should add these checks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-617) In unsercure mode, AM can fake resource requirements
[ https://issues.apache.org/jira/browse/YARN-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-617: --- Attachment: YARN-617.20130502.patch Updating existing test cases. In unsercure mode, AM can fake resource requirements - Key: YARN-617 URL: https://issues.apache.org/jira/browse/YARN-617 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Priority: Minor Attachments: YARN-617.20130501.1.patch, YARN-617.20130501.patch, YARN-617.20130502.patch Without security, it is impossible to completely avoid AMs faking resources. We can at the least make it as difficult as possible by using the same container tokens and the RM-NM shared key mechanism over unauthenticated RM-NM channel. In the minimum, this will avoid accidental bugs in AMs in unsecure mode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-645) Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager
Jian He created YARN-645: Summary: Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager Key: YARN-645 URL: https://issues.apache.org/jira/browse/YARN-645 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He RMDelegationTokenSecretManager is specific to resource manager, should not belong to server-common -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-645) Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager
[ https://issues.apache.org/jira/browse/YARN-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-645: - Attachment: YARN-645.patch this patch moved RMDelegationTokenSecretManager Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager -- Key: YARN-645 URL: https://issues.apache.org/jira/browse/YARN-645 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-645.patch RMDelegationTokenSecretManager is specific to resource manager, should not belong to server-common -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648197#comment-13648197 ] Bikas Saha commented on YARN-614: - Unfortunately that is not how it happens. The RECOVER event is enqueued but not sent since the event dispatcher starts after recovery is completed. So by the time RECOVER is reached all the state has been populated. The catch is that currently recovery stores the attempt state before launching the attempt. At that time it does not have the container status because that is obtained when the attempt finishes. So I am of the opinion of leaving recovery for a different jira. Now for the patch itself. Minor style thing. All this code could go inside app.updateFailureCount() and let it do whatever it wants because the app has access to all the data and more. That method can evolve separately without bloating the transition method. We need to check if the justFinished containers would always have an entry for the master container. Specially the case when the node is lost because it went down. {code} +// If the failure was not the AM's fault (e.g. node lost, or disk +// failure), then increment ignored failures, so we don't count the +// failure when determining whether to restart the app or not. +RMAppAttempt appMasterAttempt = app.attempts.get(app.currentAttempt +.getAppAttemptId()); +Container appMasterContainer = appMasterAttempt.getMasterContainer(); +ContainerStatus status = appMasterAttempt.getJustFinishedContainers() +.get(appMasterContainer.getId()); + +app.updateFailureCount(status.getExitStatus()); {code} I am assuming aborted implies node lost in the patch. We need to make sure that aborted is not being used as a generic catch all. Else we may need to add a new specific exit status NODE_LOST for the specific case. {code} + private boolean shouldCountFailureToAttemptLimit(int masterContainerExitStatus) { +return masterContainerExitStatus != ContainerExitStatus.DISKS_FAILED + masterContainerExitStatus != ContainerExitStatus.ABORTED; + } {code} I am not in favor of changing the List to a Map. The search is performed only once at the end of the life of the attempt and also if it has failed. So I am not sure perf is an issue here if we iterate once through this list. List is cheaper wrt memory and also maintains the order of completion of containers as received by the RM. Its cheap for the ApplicationMasterService to pull when it populates the allocate response. This code probably wont compile because ApplicationMasterService expects a list and not a map. {code} - private final ListContainerStatus justFinishedContainers = -new ArrayListContainerStatus(); + private final MapContainerId, ContainerStatus justFinishedContainers = +new HashMapContainerId, ContainerStatus(); {code} Not quite sure why this method needs to be public. If its private then it need not be part of the RMApp interface and thus MockAsm or MockRMapp need not change. {code} @Override + public int getIgnoredFailures() { +this.readLock.lock(); + +try { + return this.ignoredFailures; +} finally { + this.readLock.unlock(); +} + } + {code} Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1 -- Key: YARN-614 URL: https://issues.apache.org/jira/browse/YARN-614 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Attachments: YARN-614-0.patch, YARN-614-1.patch Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-513) Create common proxy client for communicating with RM
[ https://issues.apache.org/jira/browse/YARN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648204#comment-13648204 ] Vinod Kumar Vavilapalli commented on YARN-513: -- While Bikas continues to review the patch, I just wanted to say that the patch *and* the code overall is so much cleaner now, thanks! Will look at the patch again once these comments are addressed. Create common proxy client for communicating with RM Key: YARN-513 URL: https://issues.apache.org/jira/browse/YARN-513 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-513.1.patch, YARN-513.2.patch, YARN-513.3.patch, YARN-513.4.patch, YARN.513.5.patch, YARN-513.6.patch When the RM is restarting, the NM, AM and Clients should wait for some time for the RM to come back up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-645) Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager
[ https://issues.apache.org/jira/browse/YARN-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-645: - Attachment: YARN-645.1.patch Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager -- Key: YARN-645 URL: https://issues.apache.org/jira/browse/YARN-645 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-645.1.patch, YARN-645.patch RMDelegationTokenSecretManager is specific to resource manager, should not belong to server-common -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira