Hi I am working on running a long lived app on a secure Yarn cluster. After some reading on this domain, I want to make sure my understanding on the life cycle of an app on Kerberos-enabled Yarn is correct as below.
1. Client kinit to login into KDC and add the HDFS delegation tokens to the launcher context before submitting the application. 2. Once the resource is allocated for the Application Master, the Node Manager will localizes app resources from HDFS using the HDFS delegation tokens in the launcher context. Same rule applies to Container localization too. 3. If the app is expected to run continuously over 7 (the default token-max-life-time) days, then application developer needs to develop a way to renew and recreate HDFS delegation token and distribute them among Containers. Some strategies can be found here as per Steve, https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/yarn.html . 4. When Container or Application Master fails after token-max-life-time elapses. The original HDFS delegation token stored in the launcher context will be invalid no matter what. To mitigate this problem, users can set up RM as the proxy user to renew HDFS delegation token on behalf of the user, as per https://issues.apache.org/jira/browse/YARN-2704. After applying this, RM will periodically renew and recreate HDFS delegation tokens and update the launcher context as well as all NMs running Containers. 5. Assuming 4 is working, technically and theoretically, users can still have their app running beyond 7 days even without implementing 3, if my understanding is correct. The reason is that once Containers or AM fail b/c original HDFS delegation tokens become invalid, once restarts, the new Containers or AM will have the valid renewed or recreated HDFS delegation token from the launcher context. Of course, this is not scalable and only alleviates the problem a bit. 6. There seems to have some issues when applying (1-4) in HA Hadoop cluster (For example, https://issues.apache.org/jira/browse/HDFS-9276). So I assume this is not working for HA Hadoop cluster. It would be great if someone with insights can let me know if my understandings are correct. -- Chen Song
