On 13 Oct 2016, at 10:50, dbolshak
We've a challenge and no ideas how to solve it.
Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster
Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.
Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?
If you want to talk to the secure clsuter, C, from code running in cluster B,
you'll need to turn kerberos on there. Maybe, maybe, you could just get away
with kerberos being turned off, but you, the user, launching the application
while logged in to kerberos yourself and so trusted by Cluster C.
one of the problems you are likely to hit with Spark here is that it's only
going to collect the tokens you need to talk to HDFS at the time you launch the
application, and by default, it only knows about the cluster FS. You will need
to tell spark about the other filesystem at launch time, so it will know to
authenticate with it as you, then collect the tokens needed for the application
itself to work with kerberos.