Hi

I am working on running a long lived app on a secure Yarn cluster. After
some reading on this domain, I want to make sure my understanding on the
life cycle of an app on Kerberos-enabled Yarn is correct as below.

1. Client kinit to login into KDC and add the HDFS delegation tokens to the
launcher context before submitting the application.
2. Once the resource is allocated for the Application Master, the Node
Manager will localizes app resources from HDFS using the HDFS delegation
tokens in the launcher context. Same rule applies to Container localization
too.
3. If the app is expected to run continuously over 7 (the default
token-max-life-time) days, then application developer needs to develop a
way to renew and recreate HDFS delegation token and distribute them among
Containers. Some strategies can be found here as per Steve,
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/yarn.html
.
4. When Container or Application Master fails after token-max-life-time
elapses. The original HDFS delegation token stored in the launcher context
will be invalid no matter what. To mitigate this problem, users can set up
RM as the proxy user to renew HDFS delegation token on behalf of the user,
as per https://issues.apache.org/jira/browse/YARN-2704. After applying
this, RM will periodically renew and recreate HDFS delegation tokens and
update the launcher context as well as all NMs running Containers.
5. Assuming 4 is working, technically and theoretically, users can still
have their app running beyond 7 days even without implementing 3, if my
understanding is correct. The reason is that once Containers or AM fail b/c
original HDFS delegation tokens become invalid, once restarts, the new
Containers or AM will have the valid renewed or recreated HDFS delegation
token from the launcher context. Of course, this is not scalable and only
alleviates the problem a bit.
6. There seems to have some issues when applying (1-4) in HA Hadoop cluster
(For example, https://issues.apache.org/jira/browse/HDFS-9276). So I assume
this is not working for HA Hadoop cluster.

It would be great if someone with insights can let me know if my
understandings are correct.

-- 
Chen Song

Reply via email to