Re: Pulling data from a secured SQL database

2015-10-31 Thread Michael Armbrust
I would try using the JDBC Data Source

and save the data to parquet
.
You can then put that data on your Spark cluster (probably install HDFS).

On Fri, Oct 30, 2015 at 6:49 PM, Thomas Ginter 
wrote:

> I am working in an environment where data is stored in MS SQL Server.  It
> has been secured so that only a specific set of machines can access the
> database through an integrated security Microsoft JDBC connection.  We also
> have a couple of beefy linux machines we can use to host a Spark cluster
> but those machines do not have access to the databases directly.  How can I
> pull the data from the SQL database on the smaller development machine and
> then have it distribute to the Spark cluster for processing?  Can the
> driver pull data and then distribute execution?
>
> Thanks,
>
> Thomas Ginter
> 801-448-7676
> thomas.gin...@utah.edu
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Pulling data from a secured SQL database

2015-10-31 Thread Deenar Toraskar
Thomas

I have the same problem, though in my case getting Kerberos authentication
to MSSQLServer from the cluster nodes does not seem to be supported. There
are a couple of options that come to mind.

1) You can pull the data running sqoop in local mode on the smaller
development machines and write to HDFS or to a persistent store connected
to your Spark cluster.
2) You can run Spark in local mode on the smaller development machines and
use JDBC Data Source and do something similar.

Regards
Deenar

*Think Reactive Ltd*
deenar.toras...@thinkreactive.co.uk
07714140812




On 31 October 2015 at 11:35, Michael Armbrust 
wrote:

> I would try using the JDBC Data Source
> 
> and save the data to parquet
> .
> You can then put that data on your Spark cluster (probably install HDFS).
>
> On Fri, Oct 30, 2015 at 6:49 PM, Thomas Ginter 
> wrote:
>
>> I am working in an environment where data is stored in MS SQL Server.  It
>> has been secured so that only a specific set of machines can access the
>> database through an integrated security Microsoft JDBC connection.  We also
>> have a couple of beefy linux machines we can use to host a Spark cluster
>> but those machines do not have access to the databases directly.  How can I
>> pull the data from the SQL database on the smaller development machine and
>> then have it distribute to the Spark cluster for processing?  Can the
>> driver pull data and then distribute execution?
>>
>> Thanks,
>>
>> Thomas Ginter
>> 801-448-7676
>> thomas.gin...@utah.edu
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


RE: Pulling data from a secured SQL database

2015-10-30 Thread Young, Matthew T
> Can the driver pull data and then distribute execution?



Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary 
code to read the data on the driver as you normally would if you were writing a 
single-node application. Once you have the data in a collection on the driver's 
memory you can call 
sc.parallelize(data)<http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections>
 to distribute the data out to the workers for parallel processing as an RDD. 
You can then convert to a dataframe if that is more appropriate for your 
workflow.





-Original Message-
From: Thomas Ginter [mailto:thomas.gin...@utah.edu]
Sent: Friday, October 30, 2015 10:49 AM
To: user@spark.apache.org
Subject: Pulling data from a secured SQL database



I am working in an environment where data is stored in MS SQL Server.  It has 
been secured so that only a specific set of machines can access the database 
through an integrated security Microsoft JDBC connection.  We also have a 
couple of beefy linux machines we can use to host a Spark cluster but those 
machines do not have access to the databases directly.  How can I pull the data 
from the SQL database on the smaller development machine and then have it 
distribute to the Spark cluster for processing?  Can the driver pull data and 
then distribute execution?



Thanks,



Thomas Ginter

801-448-7676

thomas.gin...@utah.edu<mailto:thomas.gin...@utah.edu>











-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For 
additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Pulling data from a secured SQL database

2015-10-30 Thread Thomas Ginter
I am working in an environment where data is stored in MS SQL Server.  It has 
been secured so that only a specific set of machines can access the database 
through an integrated security Microsoft JDBC connection.  We also have a 
couple of beefy linux machines we can use to host a Spark cluster but those 
machines do not have access to the databases directly.  How can I pull the data 
from the SQL database on the smaller development machine and then have it 
distribute to the Spark cluster for processing?  Can the driver pull data and 
then distribute execution?

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org