Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Imtiaz Ahmed
Hi Steve,

I don't think I fully understand your answer. Please pardon my naiveness
regarding the subject. From what I understand, the actual read will happen
in the executor so executor needs access to data lake. In that sense, how
do I make sure that I can programmatically pass azure credentials to the
executor so that it can read data to process.

Another dilemma I have is that a user might be accessing more than one data
sets within a job, lets say for joining them both. In that case, I might
have two separate access tokens to read data from data lake, one for each
data set.

Does that make sense?

Imtiaz

On Sat, Aug 19, 2017 at 7:04 AM, Steve Loughran 
wrote:

>
> On 19 Aug 2017, at 02:42, Imtiaz Ahmed  wrote:
>
> Hi All,
>
> I am building a spark library which developers will use when writing their
> spark jobs to get access to data on Azure Data Lake. But the authentication
> will depend on the dataset they ask for. I need to call a rest API from
> within spark job to get credentials and authenticate to read data from
> ADLS. Is that even possible? I am new to spark.
> E.g, from inside a spark job a user will say:
>
> MyCredentials myCredentials = MyLibrary.getCredentialsForPath(userId,
> "/some/path/on/azure/datalake");
>
> then before spark.read.json("adl://examples/src/main/resources/people.json
> ")
> I need to authenticate the user to be able to read that path using the
> credentials fetched above.
>
> Any help is appreciated.
>
> Thanks,
> Imtiaz
>
>
> The ADL filesystem supports addDelegationTokens(); allowing the caller to
> collect the delegation tokens of the current authenticated user & then pass
> it along with the request —which is exactly what spark should be doing in
> spark submit.
>
> if you want to do it yourself, look in SparkHadoopUtils (I think; IDE is
> closed right now) & see how the tokens are picked up and then passed around
> (marshalled over the job request, unmarshalled after & picked up, with bits
> of the UserGroupInformation class doing the low level work)
>
> Java code snippet to write to the path tokenFile:
>
> FileSystem fs = FileSystem.get(conf);
> Credentials cred = new Credentials();
> Token tokens[] = fs.addDelegationTokens(renewer, cred);
> cred.writeTokenStorageFile(tokenFile, conf);
>
> you can then read that file in elsewhere, and then (somehow) get the FS to
> use those toakens
>
> otherwise, ADL supports Oauth, so you may be able to use any Oauth
> libraries for this. hadoop-azure-dalalake pulls in okhttp for that,
>
>  
>   com.squareup.okhttp
>   okhttp
>   2.4.0
> 
>
> -Steve
>
>


Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Patrick Alwell
This might help; I’ve built a REST API with livyServer: 
https://livy.incubator.apache.org/



From: Steve Loughran 
Date: Saturday, August 19, 2017 at 7:05 AM
To: Imtiaz Ahmed 
Cc: "user@spark.apache.org" 
Subject: Re: How to authenticate to ADLS from within spark job on the fly


On 19 Aug 2017, at 02:42, Imtiaz Ahmed 
> wrote:

Hi All,
I am building a spark library which developers will use when writing their 
spark jobs to get access to data on Azure Data Lake. But the authentication 
will depend on the dataset they ask for. I need to call a rest API from within 
spark job to get credentials and authenticate to read data from ADLS. Is that 
even possible? I am new to spark.
E.g, from inside a spark job a user will say:

MyCredentials myCredentials = MyLibrary.getCredentialsForPath(userId, 
"/some/path/on/azure/datalake");

then before spark.read.json("adl://examples/src/main/resources/people.json")
I need to authenticate the user to be able to read that path using the 
credentials fetched above.

Any help is appreciated.

Thanks,
Imtiaz

The ADL filesystem supports addDelegationTokens(); allowing the caller to 
collect the delegation tokens of the current authenticated user & then pass it 
along with the request —which is exactly what spark should be doing in spark 
submit.

if you want to do it yourself, look in SparkHadoopUtils (I think; IDE is closed 
right now) & see how the tokens are picked up and then passed around 
(marshalled over the job request, unmarshalled after & picked up, with bits of 
the UserGroupInformation class doing the low level work)

Java code snippet to write to the path tokenFile:

FileSystem fs = FileSystem.get(conf);
Credentials cred = new Credentials();
Token tokens[] = fs.addDelegationTokens(renewer, cred);
cred.writeTokenStorageFile(tokenFile, conf);

you can then read that file in elsewhere, and then (somehow) get the FS to use 
those toakens

otherwise, ADL supports Oauth, so you may be able to use any Oauth libraries 
for this. hadoop-azure-dalalake pulls in okhttp for that,

 
  com.squareup.okhttp
  okhttp
  2.4.0


-Steve



Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Steve Loughran

On 19 Aug 2017, at 02:42, Imtiaz Ahmed 
> wrote:


Hi All,

I am building a spark library which developers will use when writing their 
spark jobs to get access to data on Azure Data Lake. But the authentication 
will depend on the dataset they ask for. I need to call a rest API from within 
spark job to get credentials and authenticate to read data from ADLS. Is that 
even possible? I am new to spark.

E.g, from inside a spark job a user will say:

MyCredentials myCredentials = MyLibrary.getCredentialsForPath(userId, 
"/some/path/on/azure/datalake");

then before spark.read.json("adl://examples/src/main/resources/people.json")
I need to authenticate the user to be able to read that path using the 
credentials fetched above.

Any help is appreciated.

Thanks,
Imtiaz

The ADL filesystem supports addDelegationTokens(); allowing the caller to 
collect the delegation tokens of the current authenticated user & then pass it 
along with the request —which is exactly what spark should be doing in spark 
submit.

if you want to do it yourself, look in SparkHadoopUtils (I think; IDE is closed 
right now) & see how the tokens are picked up and then passed around 
(marshalled over the job request, unmarshalled after & picked up, with bits of 
the UserGroupInformation class doing the low level work)

Java code snippet to write to the path tokenFile:

FileSystem fs = FileSystem.get(conf);
Credentials cred = new Credentials();
Token tokens[] = fs.addDelegationTokens(renewer, cred);
cred.writeTokenStorageFile(tokenFile, conf);

you can then read that file in elsewhere, and then (somehow) get the FS to use 
those toakens

otherwise, ADL supports Oauth, so you may be able to use any Oauth libraries 
for this. hadoop-azure-dalalake pulls in okhttp for that,

 
  com.squareup.okhttp
  okhttp
  2.4.0


-Steve



Re: How to authenticate to ADLS from within spark job on the fly

2017-08-18 Thread ayan guha
It may not be as easy as you think. The rest call will happen in driver but
the reads will be in the executors.

On Sat, 19 Aug 2017 at 11:42 am, Imtiaz Ahmed  wrote:

> Hi All,
>
> I am building a spark library which developers will use when writing their
> spark jobs to get access to data on Azure Data Lake. But the authentication
> will depend on the dataset they ask for. I need to call a rest API from
> within spark job to get credentials and authenticate to read data from
> ADLS. Is that even possible? I am new to spark.
> E.g, from inside a spark job a user will say:
>
> MyCredentials myCredentials = MyLibrary.getCredentialsForPath(userId,
> "/some/path/on/azure/datalake");
>
> then before spark.read.json(
> "adl://examples/src/main/resources/people.json")
> I need to authenticate the user to be able to read that path using the
> credentials fetched above.
>
> Any help is appreciated.
>
> Thanks,
> Imtiaz
>
-- 
Best Regards,
Ayan Guha