> Can the driver pull data and then distribute execution?


Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary 
code to read the data on the driver as you normally would if you were writing a 
single-node application. Once you have the data in a collection on the driver's 
memory you can call 
sc.parallelize(data)<http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections>
 to distribute the data out to the workers for parallel processing as an RDD. 
You can then convert to a dataframe if that is more appropriate for your 
workflow.





-----Original Message-----
From: Thomas Ginter [mailto:thomas.gin...@utah.edu]
Sent: Friday, October 30, 2015 10:49 AM
To: user@spark.apache.org
Subject: Pulling data from a secured SQL database



I am working in an environment where data is stored in MS SQL Server.  It has 
been secured so that only a specific set of machines can access the database 
through an integrated security Microsoft JDBC connection.  We also have a 
couple of beefy linux machines we can use to host a Spark cluster but those 
machines do not have access to the databases directly.  How can I pull the data 
from the SQL database on the smaller development machine and then have it 
distribute to the Spark cluster for processing?  Can the driver pull data and 
then distribute execution?



Thanks,



Thomas Ginter

801-448-7676

thomas.gin...@utah.edu<mailto:thomas.gin...@utah.edu>











---------------------------------------------------------------------

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For 
additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>


Reply via email to