(Sorry sent reply via wrong account.. )
Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)
Usually you will end up having a local Kerberos set up per cluster.
So your machine accounts (hive, yarn, hbase, etc …) are going to be local to
So you will have to set up some sort of realm trusts between the clusters.
If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re
going to want to keep the machine accounts isolated to the cluster.
And the OP said that he didn’t control the other cluster which makes me believe
that they are separate.
I would also think that you would have trouble with the credential… isn’t is
tied to a user at a specific machine?
(Its been a while since I looked at this and I drank heavily to forget
Kerberos… so I may be a bit fuzzy here.)
On Oct 18, 2016, at 2:59 PM, Steve Loughran
On 17 Oct 2016, at 22:11, Michael Segel
@Steve you are going to have to explain what you mean by ‘turn Kerberos on’.
Taken one way… it could mean making cluster B secure and running Kerberos and
then you’d have to create some sort of trust between B and C,
I'd imagined making cluster B a kerberized cluster.
I don't think you need to go near trust relations though —ideally you'd just
want the same accounts everywhere if you can, if not, the main thing is that
the user submitting the job can get a credential for that far NN at job
submission time, and that credential is propagated all the way to the executors.
Did you mean turn on kerberos on the nodes in Cluster B so that each node
becomes a trusted client that can connect to C
Did you mean to turn on kerberos on the master node (eg edge node) where the
data persists if you collect() it so its off the cluster on to a single machine
and then push it from there so that only that machine has to have kerberos
running and is a trusted server to Cluster C?
Note: In option 3, I hope I said it correctly, but I believe that you would be
collecting the data to a client (edge node) before pushing it out to the
Does that make sense?
On Oct 14, 2016, at 1:32 PM, Steve Loughran
On 13 Oct 2016, at 10:50, dbolshak
We've a challenge and no ideas how to solve it.
Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster
Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.
Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?
If you want to talk to the secure clsuter, C, from code running in cluster B,
you'll need to turn kerberos on there. Maybe, maybe, you could just get away
with kerberos being turned off, but you, the user, launching the application
while logged in to kerberos yourself and so trusted by Cluster C.
one of the problems you are likely to hit with Spark here is that it's only
going to collect the tokens you need to talk to HDFS at the time you launch the
application, and by default, it only knows about the cluster FS. You will need
to tell spark about the other filesystem at launch time, so it will know to
authenticate with it as you, then collect the tokens needed for the application
itself to work with kerberos.