> Can the driver pull data and then distribute execution?
Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary code to read the data on the driver as you normally would if you were writing a single-node application. Once you have the data in a collection on the driver's memory you can call sc.parallelize(data)<http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections> to distribute the data out to the workers for parallel processing as an RDD. You can then convert to a dataframe if that is more appropriate for your workflow. -----Original Message----- From: Thomas Ginter [mailto:thomas.gin...@utah.edu] Sent: Friday, October 30, 2015 10:49 AM To: user@spark.apache.org Subject: Pulling data from a secured SQL database I am working in an environment where data is stored in MS SQL Server. It has been secured so that only a specific set of machines can access the database through an integrated security Microsoft JDBC connection. We also have a couple of beefy linux machines we can use to host a Spark cluster but those machines do not have access to the databases directly. How can I pull the data from the SQL database on the smaller development machine and then have it distribute to the Spark cluster for processing? Can the driver pull data and then distribute execution? Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu<mailto:thomas.gin...@utah.edu> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>